Write succinct

author: Martin Hafskjold Thoresen <martinhath@gmail.com> 2017-07-22 19:39:16 +0200
committer: Martin Hafskjold Thoresen <martinhath@gmail.com> 2017-07-22 19:39:16 +0200
commit: 56d35440f352d840f8e14eb2b6c70f1ed4308e69 (patch)
tree: 5cc8076cebc848593d89e2f5f8147863d58fddd1
parent: 931388cf620d1ab347a3d7ffdd1063b35f8e6dc0 (diff)
download: ads-book-56d35440f352d840f8e14eb2b6c70f1ed4308e69.tar.gz
ads-book-56d35440f352d840f8e14eb2b6c70f1ed4308e69.zip
1 files changed, 87 insertions, 2 deletions
diff --git a/book.tex b/book.tex
index c25fdc2..c41beb2 100644
--- a/book.tex
+++ b/book.tex
@@ -1300,7 +1300,7 @@ Note that vEB with hashing or y-fast trees are faster than fusion trees if $w$ i
 
 \section{van Emde Boas Tree}%
 \label{sec:vEB}
-The general idea of the van Emde Boas Tree is to try to obtain the time recurrence $T(n) = T(\sqrt(n)) + O(1)$.
+The general idea of the van Emde Boas Tree is to try to obtain the time recurrence $T(n) = T(\sqrt{n}) + O(1)$.
 We can do this by splitting the universe into $\sqrt{u}$ clusters, each of size $\sqrt{u}$, and recurse on the cluster.
 For universes that are a power of 2, this means splitting the bits of the number if half.
 Consider a word $a = \big<c, i\big>$;
@@ -1551,7 +1551,92 @@ At last, we combine the two indices (cluster index and internal index to the clu
 
 
 \chapter{Succinct Structures}
-Rank, Select
+\topics{Rank, Select}
+
+In this chapter we will look at ways to store data using as little extra space as possible.
+These structures are often static, and the one we will look at, the bit vector, will indeed be static.
+
+We divide the space hierarchy for data structures into three levels:
+Implicit data structures, which uses the information theorerical optimum space, plus a constant amount of bits;
+Succinct data structures, which uses the optimum space + $o(OPT)$ extra space; and
+Compact data structures, which uses $O(OPT)$ space.
+
+\section{Bit Vector}
+A Bit Vector is an array of bits.
+We want to support the following operations:
+$Rank(i) = $ the number of 1s in the interval $[0, i]$, and
+$Select(i) = $ the position of the $i$th 1.
+In a way, select is the inverse operation of rank.
+We also want the queries to be fast, and within the memory requirements we have set.
+
+\subsection{Rank}
+We would like to precompute queries in chunk of size $\frac{1}{2}\log n$.
+Since this is a bit vector, the total number of different such chunk is $2^{1/2 \log n} = \sqrt{n}$.
+The number of different possible queries to be done on this chunk is $1/2 \log n$,
+and the size of each answer requires $\log\log n$ bits to be represented.
+This means if we are to make a lookup table of all possibilities, it would take
+$\sqrt{n}\cdot \frac{1}{2} \log n\cdot \log\log n = o(n)$.
+
+Next we divide the vector into chunks of size $\log^2 n$, and store the cumulative rank of the chunk boundaries.
+Since there is $n / \log^2 n$ chunks and we need $\log n$ bits to store the rank, we use $n / \log n = o(n)$ extra bits.
+We now need to handle the chunks of size $\log^2 n$, which is slightly too big for the lookup tables.
+We use indirection a 2nd time!
+
+Split each chunk into subchunks of size $1/2 \log n$.
+This time, in between the subchunks we store the \emph{local} cumulative rank, instead of the global.
+This is what differs this approach from simply dividing the entire vector into $1/2 \log n$ chunks.
+Now, since the size of the chunk is $\log^2 n$, the local rank needs only $\log \log n$ bits,
+and there are $\frac{\log^2 n}{1/2 \log n} = 2 \log n$ chunk,
+so we use $2 \log n \cdot \log\log n$ bits for each chunk.
+Again, there are $\frac{n}{\log^2 n}$ chunks, the total extra space we use is:
+
+\[ \frac{n}{\log^2 n} \cdot 2 \log n \cdot \log \log n  = \frac{2n \cdot \log \log n}{\log n} = o(n)\]
+
+On queries, we take the rank of the chunk plus the relative rank of the subchunk in the chunk,
+and then the relative rank of the element within in subchunk, which is precomputed using the lookup table.
+
+It is possible to get Rank in $O(n/\log^k n)$ bits for any $k=O(1)$.
+
+\subsection{Select}
+This time we store an array of indices of every $\log n \cdot \log \log n$th 1-bit.
+That is, instead of dividing the vector into chunks of equal size, we divide it into chunks with the same number of 1s in it.
+With this we can find out which chunk a query is in in constant time, but we still need to search inside the chunk.
+In addition, the chunk sizes are now different.
+Let $r$ be the size of a given chunk.
+We consider different sizes of $r$.
+
+If $r \geq {(\log n \log\log n)}^2$ we can afford to store a lookup table of the answers:
+there are $\log n\log\log n$ queries for the chunk (one for each 1), and each answer, the index, is $\log n$ bits,
+so the cost for \emph{one} such chunk is $\log^2 n \log\log n$.
+Since the size of the chunk is at least ${(\log n \log\log n)}^2$ there cannot be more than
+$\frac{n}{{(\log n \log\log n)}^2}$ of them, so the total size for this scheme in the entire bit vector is
+$\log^2 n \log\log n \cdot \frac{n}{{(\log n \log\log n)}^2} = \frac{n}{\log\log n}$ bits.
+
+If $r \leq {(\log n \log\log n)}^2$ we store the relative indices for every subchunk of ${(\log\log n)}^2$ 1th bit.
+We get at most $\frac{n}{{(\log\log n)}^2}$ subchunks, but they only need to store $\log\log n$ bits each, since the index is relative,
+which makes the total cost $\frac{n}{{(\log\log n)}}$ bits.
+Again we consider different sizes of the now subchunks:
+if $r \geq {(\log\log n)}^4$ we store all the answer, as relative indices:
+\[O(\frac{n}{{(\log\log n)}^4} \cdot {(\log\log n)}^2 \log\log n) = O(\frac{n}{\log\log n}) \text{ bits.}\]
+If not, we have subchunks that are so small that we can use the lookup table.
+
+We end up with using $O(\frac{n}{\log\log n})$ bits, while still having constant time.
+As with Rank, it is possible to get Select in $O(n/\log^k n)$ bits for any $k=O(1)$.
+
+\section{Navigating Binary Trees}
+We look at a succinct representation of a binary tree.
+Consider the tree encoding obtained by depth first search where each node outputs one bit for each children
+that is a 1 if the child is present and 0 if it is not\todo{figure}. This representation clearly uses $2n$ bits.
+Now if we insert a $(1)$ in front of the encoding, we can nagivate the tree using Rank and Select from the previous section,
+using the encoding as a bit string.
+
+For a node $i$ in the bit vector, its left and right children are in positions $2Rank(i)$ and $2Rank(i) + 1$,
+and its parent is in position $Select(\floor{i/2})$,
+similar to in a binary heap.
+
+
+
+
 
 \chapter{Concurrency}
author	Martin Hafskjold Thoresen <martinhath@gmail.com>	2017-07-22 19:39:16 +0200
committer	Martin Hafskjold Thoresen <martinhath@gmail.com>	2017-07-22 19:39:16 +0200
commit	56d35440f352d840f8e14eb2b6c70f1ed4308e69 (patch)
tree	5cc8076cebc848593d89e2f5f8147863d58fddd1
parent	931388cf620d1ab347a3d7ffdd1063b35f8e6dc0 (diff)
download	ads-book-56d35440f352d840f8e14eb2b6c70f1ed4308e69.tar.gz ads-book-56d35440f352d840f8e14eb2b6c70f1ed4308e69.zip