From 56d35440f352d840f8e14eb2b6c70f1ed4308e69 Mon Sep 17 00:00:00 2001 From: Martin Hafskjold Thoresen Date: Sat, 22 Jul 2017 19:39:16 +0200 Subject: Write succinct --- book.tex | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 87 insertions(+), 2 deletions(-) diff --git a/book.tex b/book.tex index c25fdc2..c41beb2 100644 --- a/book.tex +++ b/book.tex @@ -1300,7 +1300,7 @@ Note that vEB with hashing or y-fast trees are faster than fusion trees if $w$ i \section{van Emde Boas Tree}% \label{sec:vEB} -The general idea of the van Emde Boas Tree is to try to obtain the time recurrence $T(n) = T(\sqrt(n)) + O(1)$. +The general idea of the van Emde Boas Tree is to try to obtain the time recurrence $T(n) = T(\sqrt{n}) + O(1)$. We can do this by splitting the universe into $\sqrt{u}$ clusters, each of size $\sqrt{u}$, and recurse on the cluster. For universes that are a power of 2, this means splitting the bits of the number if half. Consider a word $a = \big$; @@ -1551,7 +1551,92 @@ At last, we combine the two indices (cluster index and internal index to the clu \chapter{Succinct Structures} -Rank, Select +\topics{Rank, Select} + +In this chapter we will look at ways to store data using as little extra space as possible. +These structures are often static, and the one we will look at, the bit vector, will indeed be static. + +We divide the space hierarchy for data structures into three levels: +Implicit data structures, which uses the information theorerical optimum space, plus a constant amount of bits; +Succinct data structures, which uses the optimum space + $o(OPT)$ extra space; and +Compact data structures, which uses $O(OPT)$ space. + +\section{Bit Vector} +A Bit Vector is an array of bits. +We want to support the following operations: +$Rank(i) = $ the number of 1s in the interval $[0, i]$, and +$Select(i) = $ the position of the $i$th 1. +In a way, select is the inverse operation of rank. +We also want the queries to be fast, and within the memory requirements we have set. + +\subsection{Rank} +We would like to precompute queries in chunk of size $\frac{1}{2}\log n$. +Since this is a bit vector, the total number of different such chunk is $2^{1/2 \log n} = \sqrt{n}$. +The number of different possible queries to be done on this chunk is $1/2 \log n$, +and the size of each answer requires $\log\log n$ bits to be represented. +This means if we are to make a lookup table of all possibilities, it would take +$\sqrt{n}\cdot \frac{1}{2} \log n\cdot \log\log n = o(n)$. + +Next we divide the vector into chunks of size $\log^2 n$, and store the cumulative rank of the chunk boundaries. +Since there is $n / \log^2 n$ chunks and we need $\log n$ bits to store the rank, we use $n / \log n = o(n)$ extra bits. +We now need to handle the chunks of size $\log^2 n$, which is slightly too big for the lookup tables. +We use indirection a 2nd time! + +Split each chunk into subchunks of size $1/2 \log n$. +This time, in between the subchunks we store the \emph{local} cumulative rank, instead of the global. +This is what differs this approach from simply dividing the entire vector into $1/2 \log n$ chunks. +Now, since the size of the chunk is $\log^2 n$, the local rank needs only $\log \log n$ bits, +and there are $\frac{\log^2 n}{1/2 \log n} = 2 \log n$ chunk, +so we use $2 \log n \cdot \log\log n$ bits for each chunk. +Again, there are $\frac{n}{\log^2 n}$ chunks, the total extra space we use is: + +\[ \frac{n}{\log^2 n} \cdot 2 \log n \cdot \log \log n = \frac{2n \cdot \log \log n}{\log n} = o(n)\] + +On queries, we take the rank of the chunk plus the relative rank of the subchunk in the chunk, +and then the relative rank of the element within in subchunk, which is precomputed using the lookup table. + +It is possible to get Rank in $O(n/\log^k n)$ bits for any $k=O(1)$. + +\subsection{Select} +This time we store an array of indices of every $\log n \cdot \log \log n$th 1-bit. +That is, instead of dividing the vector into chunks of equal size, we divide it into chunks with the same number of 1s in it. +With this we can find out which chunk a query is in in constant time, but we still need to search inside the chunk. +In addition, the chunk sizes are now different. +Let $r$ be the size of a given chunk. +We consider different sizes of $r$. + +If $r \geq {(\log n \log\log n)}^2$ we can afford to store a lookup table of the answers: +there are $\log n\log\log n$ queries for the chunk (one for each 1), and each answer, the index, is $\log n$ bits, +so the cost for \emph{one} such chunk is $\log^2 n \log\log n$. +Since the size of the chunk is at least ${(\log n \log\log n)}^2$ there cannot be more than +$\frac{n}{{(\log n \log\log n)}^2}$ of them, so the total size for this scheme in the entire bit vector is +$\log^2 n \log\log n \cdot \frac{n}{{(\log n \log\log n)}^2} = \frac{n}{\log\log n}$ bits. + +If $r \leq {(\log n \log\log n)}^2$ we store the relative indices for every subchunk of ${(\log\log n)}^2$ 1th bit. +We get at most $\frac{n}{{(\log\log n)}^2}$ subchunks, but they only need to store $\log\log n$ bits each, since the index is relative, +which makes the total cost $\frac{n}{{(\log\log n)}}$ bits. +Again we consider different sizes of the now subchunks: +if $r \geq {(\log\log n)}^4$ we store all the answer, as relative indices: +\[O(\frac{n}{{(\log\log n)}^4} \cdot {(\log\log n)}^2 \log\log n) = O(\frac{n}{\log\log n}) \text{ bits.}\] +If not, we have subchunks that are so small that we can use the lookup table. + +We end up with using $O(\frac{n}{\log\log n})$ bits, while still having constant time. +As with Rank, it is possible to get Select in $O(n/\log^k n)$ bits for any $k=O(1)$. + +\section{Navigating Binary Trees} +We look at a succinct representation of a binary tree. +Consider the tree encoding obtained by depth first search where each node outputs one bit for each children +that is a 1 if the child is present and 0 if it is not\todo{figure}. This representation clearly uses $2n$ bits. +Now if we insert a $(1)$ in front of the encoding, we can nagivate the tree using Rank and Select from the previous section, +using the encoding as a bit string. + +For a node $i$ in the bit vector, its left and right children are in positions $2Rank(i)$ and $2Rank(i) + 1$, +and its parent is in position $Select(\floor{i/2})$, +similar to in a binary heap. + + + + \chapter{Concurrency} -- cgit v1.2.3