From 3f6946d82b76eb85386823cc5319ff32abb39c68 Mon Sep 17 00:00:00 2001 From: Martin Hafskjold Thoresen Date: Sun, 23 Jul 2017 14:41:44 +0200 Subject: Write about concurrency --- book.tex | 250 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 246 insertions(+), 4 deletions(-) (limited to 'book.tex') diff --git a/book.tex b/book.tex index c41beb2..7e28f2a 100644 --- a/book.tex +++ b/book.tex @@ -20,6 +20,9 @@ \usepackage{subcaption} \captionsetup[subfigure]{subrefformat=simple,labelformat=simple} +\usepackage{listings} +\input{listingoptions.tex} + \usepackage{lipsum} \title{Advanced Data Structures} @@ -34,8 +37,7 @@ \newcommand{\LCA}{$\textsc{LCA}$} \newcommand{\LA}{$\textsc{LA}$} - -\newenvironment{example}[0]{{\parindent=0em \textbf{Example:}}}{\vspace{1em}} +\newenvironment{example}[0]{{\vspace{1em}\parindent=0em \textbf{Example:}}}{\vspace{1em}} \newenvironment{definition}[1]{{\parindent=0em \textbf{Definition} --- \emph{#1}:}}{\vspace{1em}} \DeclareMathOperator*{\argmin}{arg\,min} % thin space, limits underneath in displays @@ -1635,12 +1637,252 @@ and its parent is in position $Select(\floor{i/2})$, similar to in a binary heap. +\chapter{Concurrency} + +\topics{Locks, Lock-free structures, lists, priority queues.} + +In this chapter we will look at primitives and data strucutres that handles multiple threads acting on them at the same time. +Multithreaded programming is extremely error prone, and implementing the primitives and data strutcures even more so. +For this reason, we will in this chapter write out pseudocode, instead of an english description. + +\section{Mutual Exclusion} +\label{sec:mutex} +Imagine we would like to find the sum of an array. +The sequential way to do this is simple. Let $s := 0$, and for each $i$ to up $n$, read $a_i$, add it to $s$, and write to $s$. +We note that in this simple example we \emph{load} and \emph{write} variables; +this is rarely though of in single threaded code, but it is paramount in multithreaded code. +We can try the same approach for the parallel version. +Assume we have $n$ threads, so that one threads job is to add $a_i$ to $s$ for its i. +Now, let $R_1(x)$ and $W_2(x)$ denote that thread 1 reads the variable $x$ and that thread 2 writes the variable $x$ to a shared storage respectively. +Note that we differentiate between \emph{shared} and \emph{local} storage for the threads. +Processor 1s instructions will be $R_1(a_i),\ R_1(r),\ W_1(r)$; note that we have omitted the addition, since it is local. +Processor 2s instruction is similar. +Now, we assume that the actual outcome of this computation is the same as some interleaving of the two threads instruction streams. +Since the interleaving is not defined, we must make sure that we obtain the correct answer no matter how we interleave the streams. +However, consider the following interleaving: +\[R_1(a_i),\ R_1(r),\ R_2(a_j),\ R_2(r),\ W_1(r),\ W_2(r)\] +Both threads read its $a$ and $r$, then to the adding, and then write out to $r$ again. +This is a problem, since $a_i$ will be lost in the sum, as the local $r$ thread 2, which is the value that +gets written out last and which contains $a_j$ does not contain $a_i$, since $r$ was read before thread 1 +had written it out. +Note that this problem is just as real if we only have two threads, and each thread gets an interval of the array. + +We would like the consistency of sequential execution without giving up the speed increases of parallel execution. +The simplest way to solve our problem above is by using a \emph{Mutex}. +A mutex is a synchronization primitive that has the following properties: +let a \emph{Critical Section} be the interval of the instruction stream which is surrounded by a mutex $lock$ and $unlock$; +a mutex ensures non-overlapping execution of critical sections, and in any non-trivial suffix of the +linearized instruction stream (that is, the interleaved stream), +some $lock$ or $unlock$ call to the mutex succedes. + +\subsection{Petersons Mutex Algorithm} +We show an algorithm for making a mutex supporting two threads using only $Read$ and $Write$. +We have three shared variables: $x$, $he\_wants$, and $she\_wants$. +$x$ is the value we would like to use exclusively. +In addition, we have a C-style \texttt{enum}, $turn$, which is either $his$ or $hers$. + +\begin{figure}[ht] + \centering + \begin{subfigure}{0.45\textwidth} + \begin{lstlisting}[title={Alices code}] +she_wants = T +turn = his +while he_wants and turn=his + spin +{ + access(x) +} +she_wants = F + \end{lstlisting} + \end{subfigure} + \begin{subfigure}{0.45\textwidth} + \begin{lstlisting}[title={Bobs code}] +he_wants = T +turn = hers +while she_wants and turn=hers + spin +{ + access(x) +} +he_wants = F + \end{lstlisting} + \end{subfigure} +\end{figure} + +In terms of $lock$ and $unlock$, the code before the curly brackets is $lock$, and the code after is $unlock$. +We show that this implements a mutex for two threads, by contradiction. +Assume that the two threads are both inside the critical section. +Without loss of generality, let Bob be the last writer of $turn$: $W_A(turn) \leq W_B(turn)$. +We can clearly see from the code that $W_A(she\_wants) \leq W_A(turn)$, which implies % and $W_B(he\_wants) \leq W_B(turn)$, +$W_A(she\_wants) \leq W_B(turn)$. +Since $W_A(she\_wants) \leq W_A(turn) \leq W_B(turn) \leq R_B(she\_wants)$, $B$ will read $she\_wants=T$. +In addition, since we let $B$ be the last writer of $turn$, we know that $turn=hers$, +which means that both conditions for the \textbf{while} loop is fulfilled. $B$ will spin!. +Now $B$ will spin until $A$ has gone out of her critical section, and writes $she\_wants=F$. +Then $B$ may enter his critical section, but now $A$ is already done! +No overlap of the critical sections, which is a contradiction. + +\section{Lock-Free Algorithms} +What does it mean for an algorithm or data structure to be ``Lock-Free''? +Lock-free code contains special instructions, the most notable of which is the $Compare\text{-}and\text{-}Swap$, or $CAS$ for short. +$CAS$ takes three arguments, an address, an old value and a new value. It checks whether the value stored at the +address given is equal to the old value, and if it is, it writes the new value and returns $T$; if +the value is not equal it returns $F$. +This might not seem special at all, but the key is that it happens \emph{atomically}; that is, we do not have the +problem from Section~\ref{sec:mutex} where one thread may overwrite anothers data. + +\begin{example} + It is possible to make a lock using the $CAS$ instruction: + Note that we do not need to $CAS$ the write at line 7, since the thread from the critical section + is the only thread that will change the value of $lock$ when it is set to $F$. + \begin{lstlisting}[title={CAS-Lock}] +lock = F +while !CAS(&lock, F, T) + spin +{ + $\dots$ +} +lock = F + \end{lstlisting} +\end{example} + +\begin{example} + We make a map-fold style reducer using the $CAS$ instruction. + $result$ is the shared variable for the accumulated result. + We imagine that the threads all get a different slice of the array, which they loop over. + \begin{lstlisting}[title={CAS-Lock}] +for f in array + tmp = compute(f) + do { + old = Read(result) + new = result + tmp + } while !CAS(&result, old, new) + \end{lstlisting} +\end{example} + +We prefer lock-free to locking, because lock-free algorithms does not block threads: +we cannot rely that all threads get a somewhat balanced running time in an interval; +some threads may run for a while, and then stop for a while. +If such a thread has aquired a lock before pausing for a long time, all threads +which need that lock simply have to wait. One could imagine a scheme which allows the threads +that actually do work to somehow be prioritized over threads that are more idle. +In a sense this is what $CAS$ allows us to do: all threads may read at any time, and +if they are to write, they have to check that nothing has changed during the time from they +first read the value to now when they want to write back a new value. + +We also note a property of the previous example: it will not block; that is, +no execution will cause $result$ to stay the same for a long time (given that \emph{some} thread gets execution time). +To see why, let us first assume that the domain of $compute$ is non-negative. +We note that the only reason that the $CAS$ should fail is if $result$ has been increased by some other thread. +So while it is possible that a given thread will loop indefinitely if $result$ always keep chaning, this also +means that some \emph{other} thread is doing great in computing its values. + + +\section{Treiber Stack} +We show a lock-free stack implementation know as a \emph{Treiber Stack}. +The idea, which is a popular one in lock free data structures, is to build the stack out of a linked. +The stack is just a pointer to the head node, which is the top of the stack. A node has a value and a pointer to the next node. +The code for $\textsc{Push}$ and $\textsc{Pop}$ is shown below. + +\begin{figure}[ht] + \centering + \begin{subfigure}{0.45\textwidth} + \begin{lstlisting}[title={Push(v)}] +n $= \textsc{Make-Node}(v)$ +do { + n.next = Read(head) +} while !CAS(head, n.next, n) + + \end{lstlisting} + \end{subfigure} + \begin{subfigure}{0.45\textwidth} + \begin{lstlisting}[title={Pop()}] +curr = Read(head) +while curr { + if CAS(head, curr, curr.next) + break + curr = Read(head) +} +return curr + \end{lstlisting} + \end{subfigure} +\end{figure} + +The idea of $\textsc{Push}$ is to first make the node that is the new head. +Then we set the $next$ pointer to be the head of the stack, such that we only +need to update the stacks head pointer. If this update fails, we update the +$next$ pointer of our local node, and retry. +Intuetivaly we understand that this operation is safe, since the stack itself +is not changed until the $CAS$ is successful, and we are done. + +In $\textsc{Pop}$ we read the stacks head pointer, and try to $CAS$ the head pointer +to the next node in the stack. If we fail, someone else have either inserted or removed an element from +the stack, so we have to read the head again. +If we succeed, our new node is not reachable for anyone else, so we do not need to do anything special; +we can simply return the node. + + +\section{Concurrent Lists} +Next, we want to make a concurrent list, without locks. +We consider an implementation similar to the Treiber Stack, where we make a node on insert, and $CAS$ it into place, +and similarily with deletions. + +However, with a little bit of imagination, we run into trouble. +Consider a list with nodes $A, B, C, D$, where the first and last node are sentinel nodes, such that nodes with a value +never has a $null$ pointer, and that the lists head pointer is never a $null$ pointer. +Assume we want to insert the node $B'$ between $B$ and $C$. +We start out by setting $B'.next = C$. +But then another thread comes in and deletes the node $B$, by $CAS$ing $A.next = C$. +Then the first thread resumes its execution, and $CAS$es $B.next = B'$, which succedes. +The insert will look like it succedes, but the node that precedes the new node in the list is +in fact not reachable from the head node! + +Another case: we have the same list. +Thread 1 wants to delete $B$, thread 2 wants to delete $C$. +Thread 2 starts out and finds the $B$ node, which it wants to $CAS$ the next pointer on. +But then, Thread 1 comes in and $CAS$es $A.next = C$, which succedes. +Thread 2 $CAS$es $B.next=D$, and both threads returnes sucecsfully. +When we are done, the list is $A, C, D$, so we have just removed a single element namely $B$, +even though both delete calls was ``succesfull''. + +This illustrates why lock free programming is hard; +a seemingly simple task has a lot of pitfalls, and recognizing the pitfalls is not trivial. + +What can we do to handle these issues? +If we allow \emph{some} locking, we can use the ``hand-over-hand'' approach with locks. +We note that inserting a note just requires locking the nodes around the place +where the new node should be. +If we want to insert between $C$ and $D$ we can lock $A$ and $B$, then release $A$ and lock $C$, +and then release $B$ and lock $D$. +Now we know that the $next$ pointer of $B$ will not change, and we can safely insert the new node. +Note that this also requires locking of delete, but in a slightly different manner: +on delete we need to lock the node we are deleting as well as its predecessor, +so that we do not risk deleting a node we are using for insertion. + +$O(n)$ locks are problematic in terms of real world performance. +Is there some way to avoid this? +We can search through the list and find the nodes we need, lock them, and then +search \emph{again}, to ensure that they are still reachable from the head. +Now we are doing a double traversal of the list, but only require two locks. +Whether this is faster or slower depends on use case. + +Another approach is to use the least significant bits of the pointers to store information; +these are usually 0 for alignment reasons. +We can then ``tag'' the $next$ pointer of the node we want to use, or delete, +in order for other $CAS$es to fail. + + + + + + + + + -\chapter{Concurrency} -Locks, Lock-free structures, lists, priority queues. \printbibliography% -- cgit v1.2.3