Interprocess Communication Concepts
Big Data applications are distributed by nature. To understand how they tick, you need to grasp concepts that distributed applications are build on. One such concept is interprocess communication or IPC, which covers ways how processes can communicate with each other.
There are different kinds of IPC used in practice including sockets, pipes, mailslots, memory mapped files, and so on. But almost all of them are based on two concepts: shared memory and message passing.
Shared Memory
Normally operating system (OS) isolates address spaces of processes to disallow them to affect each other. The cons of the isolation is that if one process needs to pass some data to another one, the data must be copied, which can by quite an expensive operation for larger data sets.
To alleviate the issue, many operating systems implement shared memory IPC. Shared memory allows two or more processes to have access to the same memory region, so that they can exchange data without copying it.
It is usually implemented via memory mapped files, in a way that shared memory is mapped to address space of each process that requires accessing it. Shared memory leaves resolution of concurrency issues to the processes involved, which often use semaphores to synchronise access to shared data.
Shared memory IPC is often preferred when processes need to exchange large amounts of data.
Message Passing
Another major IPC concept implemented by most popular operating systems is message passing. In message passing, processes exchange data by passing messages using only two operations: send and receive.
Message passing concept looks simple, but it requires a number of design choices to be made. Different combinations of design choices gave birth to many different message passing implementations in practice.
Direct vs indirect
One of choices IPC designer have to make is if message passing is direct or indirect. In direct message passing, receiver identity is known and message is sent directly to the receiving process. We can say that there is a link between two processes exchanging data. There cannot exist more than one link between any two processes.
A serious disadvantage of direct message processing is limited modularity. Changing the identity of a process requires to amend every sender and receiver having a link with the process.
In indirect message passing, messages are sent to a mailbox (or port), which is bound to a receiving process. It differs from direct message passing, because the same mailbox could be later bound to another process. A sender doesn’t know and doesn’t care which process will actually receive its message. Additionally, multiple processes can send messages to the same mailbox, thus allowing multi-process links.
Indirect message passing gives a lot of flexibility to application designer. For example, two processes can communicate via multiple mailboxes, in such a way that there are multiple links between the same two processes.
Blocking vs non-blocking
IPC designer also have to decide if send and receive operations will be blocking (synchronised) or not. send operation is blocking if sender has to wait till a message is received by receiver. Similarly, receive operation is blocking if receiver has to wait till the message is received.
Designer of message passing IPC have to choose if any or both send and receive operations should be synchronised or not.
Buffering
Another decision to be made is regarding size of receiver’s queue. There are three choices available:
- Zero sized queue (no queue). Sender has to wait until receiver is ready to take the message.
- Bounded queue. Message can be queued as long as queue is not full. Otherwise, sender has to wait.
- Unbounded queue. Sender never has to wait. Designer should be careful with this choice, as physical resources are limited and too many items in queue could lead to dangerous consequences.
The bottom line
Message passing IPC is in general more convenient, because it requires less hustle with low level synchronisation primitives.
Shared memory is commonly used when processes require large data transfer. However efficiency of shared memory decreases with growth of number of processor cores and number of cache levels due to increasing number of cash misses.