From 3f6946d82b76eb85386823cc5319ff32abb39c68 Mon Sep 17 00:00:00 2001
From: Martin Hafskjold Thoresen <martinhath@gmail.com>
Date: Sun, 23 Jul 2017 14:41:44 +0200
Subject: Write about concurrency

---
 book.tex | 250 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 246 insertions(+), 4 deletions(-)

(limited to 'book.tex')

diff --git a/book.tex b/book.tex
index c41beb2..7e28f2a 100644
--- a/book.tex
+++ b/book.tex
@@ -20,6 +20,9 @@
 \usepackage{subcaption}
 \captionsetup[subfigure]{subrefformat=simple,labelformat=simple}
 
+\usepackage{listings}
+\input{listingoptions.tex}
+
 \usepackage{lipsum}
 
 \title{Advanced Data Structures}
@@ -34,8 +37,7 @@
 \newcommand{\LCA}{$\textsc{LCA}$}
 \newcommand{\LA}{$\textsc{LA}$}
 
-
-\newenvironment{example}[0]{{\parindent=0em \textbf{Example:}}}{\vspace{1em}}
+\newenvironment{example}[0]{{\vspace{1em}\parindent=0em \textbf{Example:}}}{\vspace{1em}}
 \newenvironment{definition}[1]{{\parindent=0em \textbf{Definition} --- \emph{#1}:}}{\vspace{1em}}
 
 \DeclareMathOperator*{\argmin}{arg\,min} % thin space, limits underneath in displays
@@ -1635,12 +1637,252 @@ and its parent is in position $Select(\floor{i/2})$,
 similar to in a binary heap.
 
 
+\chapter{Concurrency}
+
+\topics{Locks, Lock-free structures, lists, priority queues.}
+
+In this chapter we will look at primitives and data strucutres that handles multiple threads acting on them at the same time.
+Multithreaded programming is extremely error prone, and implementing the primitives and data strutcures even more so.
+For this reason, we will in this chapter write out pseudocode, instead of an english description.
+
+\section{Mutual Exclusion}
+\label{sec:mutex}
+Imagine we would like to find the sum of an array.
+The sequential way to do this is simple. Let $s := 0$, and for each $i$ to up $n$, read $a_i$, add it to $s$, and write to $s$.
+We note that in this simple example we \emph{load} and \emph{write} variables;
+this is rarely though of in single threaded code, but it is paramount in multithreaded code.
+We can try the same approach for the parallel version.
+Assume we have $n$ threads, so that one threads job is to add $a_i$ to $s$ for its i.
+Now, let $R_1(x)$ and $W_2(x)$ denote that thread 1 reads the variable $x$ and that thread 2 writes the variable $x$ to a shared storage respectively.
+Note that we differentiate between \emph{shared} and \emph{local} storage for the threads.
+Processor 1s instructions will be $R_1(a_i),\ R_1(r),\ W_1(r)$; note that we have omitted the addition, since it is local.
+Processor 2s instruction is similar.
+Now, we assume that the actual outcome of this computation is the same as some interleaving of the two threads instruction streams.
+Since the interleaving is not defined, we must make sure that we obtain the correct answer no matter how we interleave the streams.
+However, consider the following interleaving:
+\[R_1(a_i),\ R_1(r),\ R_2(a_j),\ R_2(r),\ W_1(r),\ W_2(r)\]
+Both threads read its $a$ and $r$, then to the adding, and then write out to $r$ again.
+This is a problem, since $a_i$ will be lost in the sum, as the local $r$ thread 2, which is the value that
+gets written out last and which contains $a_j$ does not contain $a_i$, since $r$ was read before thread 1
+had written it out.
+Note that this problem is just as real if we only have two threads, and each thread gets an interval of the array.
+
+We would like the consistency of sequential execution without giving up the speed increases of parallel execution.
+The simplest way to solve our problem above is by using a \emph{Mutex}.
+A mutex is a synchronization primitive that has the following properties:
+let a \emph{Critical Section} be the interval of the instruction stream which is surrounded by a mutex $lock$ and $unlock$;
+a mutex ensures non-overlapping execution of critical sections, and in any non-trivial suffix of the
+linearized instruction stream (that is, the interleaved stream),
+some $lock$ or $unlock$ call to the mutex succedes.
+
+\subsection{Petersons Mutex Algorithm}
+We show an algorithm for making a mutex supporting two threads using only $Read$ and $Write$.
+We have three shared variables: $x$, $he\_wants$, and $she\_wants$.
+$x$ is the value we would like to use exclusively.
+In addition, we have a C-style \texttt{enum}, $turn$, which is either $his$ or $hers$.
+
+\begin{figure}[ht]
+    \centering
+    \begin{subfigure}{0.45\textwidth}
+        \begin{lstlisting}[title={Alices code}]
+she_wants = T
+turn = his
+while he_wants and turn=his
+    spin
+{
+    access(x)
+}
+she_wants = F
+        \end{lstlisting}
+    \end{subfigure}
+    \begin{subfigure}{0.45\textwidth}
+        \begin{lstlisting}[title={Bobs code}]
+he_wants = T
+turn = hers
+while she_wants and turn=hers
+    spin
+{
+    access(x)
+}
+he_wants = F
+        \end{lstlisting}
+    \end{subfigure}
+\end{figure}
+
+In terms of $lock$ and $unlock$, the code before the curly brackets is $lock$, and the code after is $unlock$.
+We show that this implements a mutex for two threads, by contradiction.
+Assume that the two threads are both inside the critical section.
+Without loss of generality, let Bob be the last writer of $turn$: $W_A(turn) \leq W_B(turn)$.
+We can clearly see from the code that $W_A(she\_wants) \leq W_A(turn)$, which implies % and $W_B(he\_wants) \leq W_B(turn)$,
+$W_A(she\_wants) \leq W_B(turn)$.
+Since $W_A(she\_wants) \leq W_A(turn) \leq W_B(turn) \leq R_B(she\_wants)$, $B$ will read $she\_wants=T$.
+In addition, since we let $B$ be the last writer of $turn$, we know that $turn=hers$,
+which means that both conditions for the \textbf{while} loop is fulfilled. $B$ will spin!.
+Now $B$ will spin until $A$ has gone out of her critical section, and writes $she\_wants=F$.
+Then $B$ may enter his critical section, but now $A$ is already done!
+No overlap of the critical sections, which is a contradiction.
+
+\section{Lock-Free Algorithms}
+What does it mean for an algorithm or data structure to be ``Lock-Free''?
+Lock-free code contains special instructions, the most notable of which is the $Compare\text{-}and\text{-}Swap$, or $CAS$ for short.
+$CAS$ takes three arguments, an address, an old value and a new value. It checks whether the value stored at the
+address given is equal to the old value, and if it is, it writes the new value and returns $T$; if
+the value is not equal it returns $F$.
+This might not seem special at all, but the key is that it happens \emph{atomically}; that is, we do not have the
+problem from Section~\ref{sec:mutex} where one thread may overwrite anothers data.
+
+\begin{example}
+    It is possible to make a lock using the $CAS$ instruction:
+    Note that we do not need to $CAS$ the write at line 7, since the thread from the critical section
+    is the only thread that will change the value of $lock$ when it is set to $F$.
+    \begin{lstlisting}[title={CAS-Lock}]
+lock = F
+while !CAS(&lock, F, T)
+    spin
+{
+    $\dots$
+}
+lock = F
+    \end{lstlisting}
+\end{example}
+
+\begin{example}
+    We make a map-fold style reducer using the $CAS$ instruction.
+    $result$ is the shared variable for the accumulated result.
+    We imagine that the threads all get a different slice of the array, which they loop over.
+    \begin{lstlisting}[title={CAS-Lock}]
+for f in array
+    tmp = compute(f)
+    do {
+        old = Read(result)
+        new = result + tmp
+    } while !CAS(&result, old, new)
+    \end{lstlisting}
+\end{example}
+
+We prefer lock-free to locking, because lock-free algorithms does not block threads:
+we cannot rely that all threads get a somewhat balanced running time in an interval;
+some threads may run for a while, and then stop for a while.
+If such a thread has aquired a lock before pausing for a long time, all threads
+which need that lock simply have to wait. One could imagine a scheme which allows the threads
+that actually do work to somehow be prioritized over threads that are more idle.
+In a sense this is what $CAS$ allows us to do: all threads may read at any time, and
+if they are to write, they have to check that nothing has changed during the time from they
+first read the value to now when they want to write back a new value.
+
+We also note a property of the previous example: it will not block; that is,
+no execution will cause $result$ to stay the same for a long time (given that \emph{some} thread gets execution time).
+To see why, let us first assume that the domain of $compute$ is non-negative.
+We note that the only reason that the $CAS$ should fail is if $result$ has been increased by some other thread.
+So while it is possible that a given thread will loop indefinitely if $result$ always keep chaning, this also
+means that some \emph{other} thread is doing great in computing its values.
+
+
+\section{Treiber Stack}
+We show a lock-free stack implementation know as a \emph{Treiber Stack}.
+The idea, which is a popular one in lock free data structures, is to build the stack out of a linked.
+The stack is just a pointer to the head node, which is the top of the stack. A node has a value and a pointer to the next node.
+The code for $\textsc{Push}$ and $\textsc{Pop}$ is shown below.
+
+\begin{figure}[ht]
+    \centering
+    \begin{subfigure}{0.45\textwidth}
+        \begin{lstlisting}[title={Push(v)}]
+n $= \textsc{Make-Node}(v)$
+do {
+    n.next = Read(head)
+} while !CAS(head, n.next, n)
+
+        \end{lstlisting}
+    \end{subfigure}
+    \begin{subfigure}{0.45\textwidth}
+        \begin{lstlisting}[title={Pop()}]
+curr = Read(head)
+while curr {
+    if CAS(head, curr, curr.next)
+        break
+    curr = Read(head)
+}
+return curr
+        \end{lstlisting}
+    \end{subfigure}
+\end{figure}
+
+The idea of $\textsc{Push}$ is to first make the node that is the new head.
+Then we set the $next$ pointer to be the head of the stack, such that we only
+need to update the stacks head pointer. If this update fails, we update the
+$next$ pointer of our local node, and retry.
+Intuetivaly we understand that this operation is safe, since the stack itself
+is not changed until the $CAS$ is successful, and we are done.
+
+In $\textsc{Pop}$ we read the stacks head pointer, and try to $CAS$ the head pointer
+to the next node in the stack. If we fail, someone else have either inserted or removed an element from
+the stack, so we have to read the head again.
+If we succeed, our new node is not reachable for anyone else, so we do not need to do anything special;
+we can simply return the node.
+
+
+\section{Concurrent Lists}
+Next, we want to make a concurrent list, without locks.
+We consider an implementation similar to the Treiber Stack, where we make a node on insert, and $CAS$ it into place,
+and similarily with deletions.
+
+However, with a little bit of imagination, we run into trouble.
+Consider a list with nodes $A, B, C, D$, where the first and last node are sentinel nodes, such that nodes with a value
+never has a $null$ pointer, and that the lists head pointer is never a $null$ pointer.
+Assume we want to insert the node $B'$ between $B$ and $C$.
+We start out by setting $B'.next = C$.
+But then another thread comes in and deletes the node $B$, by $CAS$ing $A.next = C$.
+Then the first thread resumes its execution, and $CAS$es $B.next = B'$, which succedes.
+The insert will look like it succedes, but the node that precedes the new node in the list is
+in fact not reachable from the head node!
+
+Another case: we have the same list.
+Thread 1 wants to delete $B$, thread 2 wants to delete $C$.
+Thread 2 starts out and finds the $B$ node, which it wants to $CAS$ the next pointer on.
+But then, Thread 1 comes in and $CAS$es $A.next = C$, which succedes.
+Thread 2 $CAS$es $B.next=D$, and both threads returnes sucecsfully.
+When we are done, the list is $A, C, D$, so we have just removed a single element namely $B$,
+even though both delete calls was ``succesfull''.
+
+This illustrates why lock free programming is hard;
+a seemingly simple task has a lot of pitfalls, and recognizing the pitfalls is not trivial.
+
+What can we do to handle these issues?
+If we allow \emph{some} locking, we can use the ``hand-over-hand'' approach with locks.
+We note that inserting a note just requires locking the nodes around the place
+where the new node should be.
+If we want to insert between $C$ and $D$ we can lock $A$ and $B$, then release $A$ and lock $C$,
+and then release $B$ and lock $D$.
+Now we know that the $next$ pointer of $B$ will not change, and we can safely insert the new node.
+Note that this also requires locking of delete, but in a slightly different manner:
+on delete we need to lock the node we are deleting as well as its predecessor,
+so that we do not risk deleting a node we are using for insertion.
+
+$O(n)$ locks are problematic in terms of real world performance.
+Is there some way to avoid this?
+We can search through the list and find the nodes we need, lock them, and then
+search \emph{again}, to ensure that they are still reachable from the head.
+Now we are doing a double traversal of the list, but only require two locks.
+Whether this is faster or slower depends on use case.
+
+Another approach is to use the least significant bits of the pointers to store information;
+these are usually 0 for alignment reasons.
+We can then ``tag'' the $next$ pointer of the node we want to use, or delete,
+in order for other $CAS$es to fail.
+
+
+
+
+
+
+
+
+
 
 
 
-\chapter{Concurrency}
 
-Locks, Lock-free structures, lists, priority queues.
 
 \printbibliography%
 
-- 
cgit v1.2.3