We've been waiting for it too long
What could be dumber than waiting?
B. Grebenshchikov
During this lecture, you will learn
Using the select system call
Using the poll system call
Some aspects of using select/poll in multithreaded programs
Standard Asynchronous I/O
select system call
If your program is mostly I/O, you can get the most important benefits of multithreading in a single-threaded program by using the select(3C) system call. On most Unix systems select is a system call, or at least it is described in the system manual section 2 (system calls), i.e. it should be referenced as select(2), but in Solaris 10 the corresponding system manual page is in section 3C (the C standard library).
I/O devices are usually much slower than the CPU, so the CPU usually has to wait for them to perform operations on them. Therefore, in all operating systems, synchronous I/O system calls are blocking operations.
This also applies to network communications - interaction over the Internet is associated with high delays and, as a rule, occurs through a not very wide and / or overloaded communication channel.
If your program works with multiple I/O devices and/or network connections, it is not profitable for it to block on an operation associated with one of these devices, because in this state it may miss the opportunity to perform I/O from another device without blocking. This problem can be solved by creating threads that work with various devices. In previous lectures, we have learned everything you need to develop such programs. However, there are other means to solve this problem.
The select(3C) system call allows you to wait for multiple devices or network connections to be ready (actually, objects of most types that can be identified by a file descriptor). When one or more of the descriptors are ready to send data, select(3C) returns control to the program and passes lists of ready descriptors in output parameters.
Select(3C) uses sets (sets) of descriptors as parameters. In older Unix systems, sets were implemented as 1024-bit bitmasks. In modern Unix systems and in other operating systems that implement select, sets are implemented as an opaque fd_set type, on which some set-theoretic operations are defined, namely, clearing a set, including a descriptor in a set, excluding a descriptor from a set, and checking if a descriptor is in a set. The preprocessor directives for performing these operations are described in the select(3C) manual page.
On 32-bit versions of UnixSVR4, including Solaris, fd_set is still a 1024-bit mask; in 64-bit versions of SVR4, this is a 65536-bit mask. The size of the mask determines not only the maximum number of file descriptors in the set, but also the maximum number of file descriptors in the set. The size of the mask in your version of the system can be determined at compile time by the value of the FD_SETSIZE preprocessor symbol. Unix file descriptor numbering starts at 0, so the maximum descriptor number is FD_SETSIZE-1.
Thus, if you use select(3C), you need to set limits on the number of handles to your process. This can be done with the ulimit(1) shell command before the process starts, or with the setrlimit(2) system call while your process is already running. Of course, setrlimit(2) must be called before you start creating file descriptors.
If you need to use more than 1024 handles in a 32-bit program, Solaris10 provides a transitional API. To use it, you need to define
preprocessor symbol FD_SETSIZE with a numeric value greater than 1024 before including the file
In some implementations, fd_set is implemented in other ways, without the use of bitmasks. For example, Win32 provides select as part of the so-called WinsockAPI. In Win32, fd_set is implemented as a dynamic array containing file descriptor values. Therefore, you should not rely on knowing the internal structure of the fd_set type.
Either way, changing the size of the fd_set bitmask, or the internal representation of that type, requires recompilation of all programs using select(3C). In the future, when the architectural limit of 65536 descriptors per process is increased, a new version of the implementation of fd_set and select and a new recompilation of programs may be required. To avoid this and make it easier to migrate to the new version of ABI, SunMicrosystems recommends that you deprecate select(3C) and use the poll(2) system call instead. The poll(2) system call is discussed later in this chapter.
The select(3C) system call has five parameters.
intnfds is a number one greater than the maximum file descriptor number in all sets passed as parameters.
fd_set*readfds - An input parameter, a set of descriptors that should be checked for readability. End of file or socket close is considered a special case of being ready to read. Regular files are always considered read-ready. Also, if you want to check if a listening TCP socket is ready to accept(3SOCKET), it should be included in this set. Also, the output parameter is a set of descriptors ready to be read.
fd_set*writefds - An input parameter, a set of descriptors that should be checked for write readiness. A delayed write error is considered a special case of write readiness. Regular files are always ready to be written. Also, if you want to test the completion of an asynchronous connect(3SOCKET) operation, the socket should be included in this set. Also, the output parameter is a set of descriptors ready to be written.
fd_set*errorfds – An input parameter, a set of descriptors to check for exception conditions. The definition of an exception depends on the type of file descriptor. For TCP sockets, an exception occurs when out-of-band data arrives. Regular files are always considered to be in an exceptional state. Also, the output parameter is the set of descriptors on which exceptional conditions occurred.
structtimeval*timeout – timeout, time interval specified with microsecond precision. If this parameter is NULL, then select(3C) will wait indefinitely; if the structure has a time interval of zero, select(3C) operates in polling mode, that is, it returns control immediately, possibly with empty descriptor sets.
Instead of all fd_set* type parameters, you can pass a null pointer. This means that we are not interested in the corresponding event class. select(3C) returns the total number of ready descriptors in all sets on normal completion (including timeout completion), and -1 on error.
Example 1 uses select(3C) to copy data from a network connection to a terminal, and from a terminal to network connection. This program is simplified, it assumes that writing to the terminal and network connection will never be blocked. Since both the terminal and the network connection have internal buffers, this is usually the case for small data streams.
Example 1: Bilateral copying of data between a terminal and a network connection. The example is taken from W.R. Stevens, Unix: Network Application Development. Instead of standard system calls, "wrappers" are used, described in the file "unp.h"
#include "unp.h"
void str_cli(FILE *fp, int sockfd) (
int maxfdp1, stdineof;
char sendline, recvline;
if (stdineof == 0) FD_SET(fileno(fp), &rset);
FD_SET(sockfd, &rset);
maxfdp1 = max(fileno(fp), sockfd) + 1;
Select(maxfdp1, &rset, NULL, NULL, NULL);
if (FD_ISSET(sockfd, &rset)) ( /* socket is readable */
if (Readline(sockfd, recvline, MAXLINE) == 0) (
if (stdineof == 1) return; /* normal termination */
else err_quit("str_cli: server terminated prematurely");
Fputs(recvline, stdout);
if (FD_ISSET(fileno(fp), &rset)) ( /* input is readable */
if (Fgets(sendline, MAXLINE, fp) == NULL) (
Shutdown(sockfd, SHUT_WR); /* send FIN */
FD_CLR(fileno(fp), &rset);
Writen(sockfd, sendline, strlen(sendline));
Note that the example 1 program re-creates the descriptor sets before each call to select(3C). This is necessary because select(3C) modifies its options when it exits normally.
select(3C) is considered MT-Safe, but when using it in a multi-threaded program, the following point must be kept in mind. Indeed, select(3C) by itself does not use local data, and therefore calling it from multiple threads should not lead to problems. However, if multiple threads are operating on overlapping sets of file descriptors, the following scenario is possible:
Thread 1 calls read from handle s and gets all the data from its buffer
Thread 2 calls read from handle s and blocks.
To avoid this scenario, handling file descriptors under such conditions should be protected by mutexes or some other mutex primitives. It is important to emphasize that it is not select that needs to be protected, but rather the sequence of operations on a specific file descriptor, starting with including the descriptor in the set for select and ending with receiving data from this descriptor, more precisely, updating the pointers in the buffer into which you read this data. If this is not done, even more exciting scenarios are possible, for example:
Thread 1 includes the handle s in the set readfds and calls select.
select in thread 1 returns s as ready to be read
Thread 2 includes the handle s in the set readfds and calls select
select in thread 2 returns s as ready for reading
Thread 1 calls read from descriptor s and gets only part of the data from its buffer
Thread 2 calls read from the s descriptor, receives the data, and overwrites the data received by thread 1
In Lecture 10, we will look at the architecture of an application in which multiple threads share a common pool of file descriptors—the so-called workerthreads architecture. In this case, the threads, of course, must indicate to each other which descriptors they are currently working with.
From a multithreaded program development point of view, an important disadvantage of select(3C) - or perhaps a disadvantage of POSIXThreadAPI - is the fact that POSIX synchronization primitives are not file descriptors and cannot be used in select(3C). At the same time, in actual development of multi-threaded I/O programs, it would often be useful to wait in one operation for file descriptors to be ready and for other threads of the own process to be ready.
As you know, there are two main input/output modes: exchange mode with I/O device readiness polling and exchange mode with interrupts.
In the exchange mode with a polling of readiness, I / O is controlled by the central processor. The CPU sends a command to the control device to perform some action on the I/O device. The latter executes the command, translating signals understandable to the central device and the control device into signals understandable to the input / output device. But the speed of the I/O device is much less than the speed of the CPU. Therefore, you have to wait a very long time for the ready signal, constantly polling the corresponding interface line for the presence or absence of the desired signal. It makes no sense to send a new command without waiting for the ready signal that the previous command has been executed. In the readiness polling mode, the driver that controls the process of data exchange with an external device just executes the command “check for a ready signal” in a cycle. Until the ready signal appears, the driver does nothing else. In this case, of course, the time of the central processor is irrationally used. It is much more profitable to issue an I/O command, forget about the I/O device for a while and switch to another program. And the appearance of a ready signal is interpreted as an interrupt request from the I / O device. It is these ready signals that are interrupt request signals.
Interrupt exchange mode is inherently an asynchronous control mode. In order not to lose connection with the device, a countdown can be started, during which the device must necessarily execute the command and issue an interrupt request signal. The maximum amount of time that an I/O device or its controller must issue an interrupt request signal is often referred to as the statutory timeout. If this time has elapsed after issuing the next command to the device, and the device has not responded, then it is concluded that the connection with the device has been lost and it is no longer possible to control it. The user and/or task receive the corresponding diagnostic message.
Rice. 4.1. I/O control
Drivers. operating in interrupt mode are a complex set of software modules and can have several sections: a start section, one or more continue sections, and an end section.
The start section initiates an I/O operation. This section is run to turn on an I/O device, or simply to initiate another I/O operation.
The continuation section (there may be several of them if the data exchange control algorithm is complex and several interrupts are required to perform one logical operation) performs the main work of data transfer. The continuation section, in fact, is the main interrupt handler. The interface used may require several sequences of control commands to control I / O, and the interrupt signal for the device, as a rule, is only one. Therefore, after executing the next interrupt section, the interrupt supervisor must transfer control to another section at the next ready signal. This is done by changing the interrupt processing address after the execution of the next section, but if there is only one interrupt section, then it itself transfers control to one or another processing module.
The completion section usually turns off the I/O device, or simply terminates the operation.
An I/O operation can be performed on the program module that requested the operation, either synchronously or asynchronously. The meaning of these modes is the same as for the system calls discussed above - synchronous mode means that the program module suspends its work until the I / O operation is completed, and in asynchronous mode, the program module continues to execute in multiprogram mode at the same time with I/O operation. The difference lies in the fact that the I / O operation can be initiated not only by the user process - in this case, the operation is performed as part of a system call, but also by kernel code, for example, the code of the virtual memory subsystem for reading a page that is not in memory.
Rice. 7.1. Two modes of performing I/O operations
The I/O subsystem must provide its clients (user processes and kernel code) with the ability to perform both synchronous and asynchronous I/O operations, depending on the needs of the caller. I/O system calls are often framed as synchronous procedures due to the fact that such operations take a long time and the user process or thread will still have to wait for the results of the operation in order to continue its work. Internal I/O calls from kernel modules are usually done as asynchronous procedures, because kernel code needs freedom to choose what to do after an I/O request. The use of asynchronous procedures leads to more flexible solutions, since on the basis of an asynchronous call it is always possible to build a synchronous one by creating an additional intermediate procedure that blocks the execution of the calling procedure until the I/O is completed. Sometimes an application process also needs to perform an asynchronous I/O operation, for example, in a microkernel architecture, when part of the code runs in user mode as an application process, but executes functions operating system, requiring complete freedom of action and after the call of the I / O operation.
The task that issued the request for the I/O operation is transferred by the supervisor to the state of waiting for the completion of the ordered operation. When the supervisor receives a message from the completion section that the operation has completed, it puts the task in a ready-to-run state and it continues its work. This situation corresponds to synchronous I/O. Synchronous I/O is standard on most operating systems. To increase the speed of application execution, it was proposed to use asynchronous I / O if necessary.
The simplest variant of asynchronous output is the so-called buffered output to an external device, in which data from the application is not transferred directly to the I / O device, but to a special system buffer. In this case, logically, the output operation for the application is considered to be completed immediately, and the task does not have to wait for the end of the actual data transfer process to the device. The process of actually outputting data from the system buffer is handled by the I/O supervisor. Naturally, a special system process is engaged in allocating a buffer from the system memory area at the direction of the I/O supervisor. So, for the considered case, the output will be asynchronous if, firstly, the I / O request indicated the need for data buffering, and secondly, if the I / O device allows such asynchronous operations and this is noted in the UCB. You can organize and asynchronous data entry. However, for this it is necessary not only to allocate a memory area for temporary storage of data read from the device and to associate the allocated buffer with the task that ordered the operation, but also to split the request for the I/O operation into two parts (two requests). The first request specifies an operation to read data, similar to what is done with synchronous I/O. However, the type (code) of the request is used differently, and the request specifies at least one additional parameter - the name (code) of the system object that the task receives in response to the request and which identifies the allocated buffer. Having received the name of the buffer (we will conditionally call this system object in this way, although in various operating systems other terms are used to designate it, for example, a class), the task continues its work. It is very important to emphasize here that as a result of an asynchronous input request, the task is not transferred by the I / O supervisor to the state of waiting for the completion of the I / O operation, but remains in the running state or in the state of readiness for execution. After some time, after executing the necessary code that was defined by the programmer, the task issues a second request to complete the I/O operation. In this second request to the same device, which, of course, has a different code (or request name), the task specifies the name of the system object (buffer for asynchronous data input) and, in case of successful completion of the data reading operation, immediately receives them from the system buffer. If the data has not yet been completely rewritten from the external device to the system buffer, the I / O supervisor puts the task into the state of waiting for the completion of the I / O operation, and then everything resembles a normal synchronous data input.
Typically, asynchronous I/O is provided in most multiprogram OSes, especially if the OS supports multitasking through the threading mechanism. However, if asynchronous I / O is not explicitly available, you can implement its ideas yourself by organizing an independent thread for data output.
I/O hardware can be viewed as a collection hardware processors, which are able to work in parallel with respect to each other, as well as with respect to the central processing unit (processors). On such "processors" the so-called external processes. For example, for an external device (an input/output device), an external process can be a set of operations that move the print head, advance the paper one position, change the color of the ink, or print some characters. External processes, using I / O equipment, interact both with each other and with ordinary "software" processes running on the central processor. Important in this case is the fact that the speed of execution of external processes will differ significantly (sometimes by an order of magnitude or more) from the speed of execution of ordinary (" internal”) processes. For their normal operation, external and internal processes must be synchronized. To smooth out the effect of a strong speed mismatch between internal and external processes, the buffering mentioned above is used. Thus, we can speak of a system of parallel interacting processes (see Chapter 6).
Buffers are a critical resource in relation to internal (software) and external processes, which, in the course of their parallel development, interact informationally. Through the buffer (buffers), data is either sent from some process to the addressed external one (the operation of outputting data to an external device), or transferred from an external process to some software process (the operation of reading data). The introduction of buffering as a means of information interaction puts forward the problem of managing these system buffers, which is solved by means of the supervisory part of the OS. At the same time, the supervisor is tasked not only with allocating and freeing buffers in the system memory area, but also with synchronizing processes in accordance with the state of operations for filling or freeing buffers, as well as waiting for them if there are no free buffers available, and a request for input / the output requires buffering. Usually, the I/O supervisor uses the standard synchronization tools adopted in the given OS to solve the above tasks. Therefore, if the OS has developed tools for solving the problems of parallel execution of interacting applications and tasks, then, as a rule, it also implements asynchronous I/O.
An application programmer does not have to think about such things as how system programs work with device registers. The system hides the details of low-level work with devices from applications. However, the distinction between polling and interrupt I/O is reflected to some extent at the level of system functions, in the form of functions for synchronous and asynchronous I/O.
Function execution synchronous I/O involves starting an I/O operation and waiting for that operation to complete. Only after the I/O is complete, the function returns control to the calling program.
Synchronous I/O is the most familiar way for programmers to work with devices. The standard input/output routines of programming languages work in this way.
Function call asynchronous I/O means only the start of the corresponding operation. After that, the function immediately returns control to the calling program, without waiting for the operation to complete.
Consider, for example, asynchronous data entry. It is clear that the program cannot access data until it is sure that its input has been completed. But it is quite possible that the program can do other work for the time being, and not be idle in anticipation.
Sooner or later, the program must still start working with the entered data, but first make sure that the asynchronous operation has already completed. To do this, various operating systems provide tools that can be divided into three groups.
· Waiting for the completion of the operation. It's like "the second half of a synchronous operation." The program first started the operation, then performed some extraneous actions, and now it is waiting for the end of the operation, as with synchronous input / output.
· Checking the completion of the operation. In this case, the program does not wait, but only checks the status of the asynchronous operation. If the input / output is not yet completed, then the program has the opportunity to take a walk for some more time.
· Appointment of termination procedure. In this case, when starting an asynchronous operation, the user program tells the system the address of a user-defined procedure or function that should be called by the system after the operation is completed. The program itself may no longer be interested in the I / O progress, the system will remind it of this at the right time by calling the specified function. This method is the most flexible, since the user can provide any actions in the completion procedure.
In a Windows application, all three methods for completing asynchronous operations are available. UNIX does not have asynchronous I/O functions, but the same asynchronous effect can be achieved in another way, by running an additional process.
Asynchronous I/O can in some cases improve performance and provide additional functionality. Without such a simple form of asynchronous input as “keyboard input without waiting”, numerous computer games and simulators would not be possible. At the same time, the logic of a program using asynchronous operations is more complicated than with synchronous operations.
And what is the above relationship between synchronous / asynchronous operations and the I / O methods discussed in the previous paragraph? Answer this question yourself.