diff options
| -rw-r--r-- | book.tex | 285 |
1 files changed, 280 insertions, 5 deletions
@@ -1,7 +1,10 @@ \documentclass[a4paper]{book} \usepackage[a4paper, top=3cm,bottom=3cm, left=3cm, right=3cm% -%, paperheight=23cm +, paperheight=23cm ]{geometry} +\usepackage{graphicx} +\usepackage{subcaption} +\usepackage{booktabs} \usepackage{amsmath} \usepackage{amsfonts} \usepackage{mathtools} @@ -12,6 +15,10 @@ \usepackage[backend=bibtex]{biblatex} \bibliography{bibtex} +\usepackage{cleveref} +\usepackage{subcaption} +\captionsetup[subfigure]{subrefformat=simple,labelformat=simple} + \usepackage{lipsum} \title{Advanced Data Structures} @@ -20,6 +27,7 @@ \newcommand{\topics}[1]{Topics: \textit{#1}} \newcommand{\code}[2]{\textsc{#1}$(#2)$} +\newcommand{\algo}[1]{\textsc{#1}} \newcommand{\RMQ}{$\textsc{RMQ}$} \newcommand{\LCA}{$\textsc{LCA}$} @@ -27,7 +35,7 @@ \newenvironment{example}[0]{{\parindent=0em \textbf{Example:}}}{\vspace{1em}} -\newenvironment{definition}[1]{{\parindent=0em \textbf{Definition} --- \emph{#1}:\\}}{\vspace{1em}} +\newenvironment{definition}[1]{{\parindent=0em \textbf{Definition} --- \emph{#1}:}}{\vspace{1em}} \DeclareMathOperator*{\argmin}{arg\,min} % thin space, limits underneath in displays \DeclarePairedDelimiter{\ceil}{\lceil}{\rceil} @@ -163,6 +171,7 @@ If the hashing functions are fully random or $\log n$-wise independent we get $O \chapter{Static Tree Queries} +\label{ch:statictrees} \topics{Range Minimum Query (\code{RMQ}{i, j}), Lowest Common Ancestor (\code{LCA}{x, y}), Level Ancestor (\code{LA}{x, n}), Range Queries, Range Trees.} @@ -180,7 +189,7 @@ It turns out that \LCA{} and \RMQ{} are equivalent. \subsection{Reduction from \RMQ{} to \LCA{}} \todo{Add tree from lecture notes} -Build a \emph{Cartesian Tree}: +Build a \emph{Cartesian Tree\label{sec:cartesian-tree}}: Walk through the array, while keeping track of the right spine of the tree. When inserting a new element, if the element is the largest element in the spine, insert it at the end. Else, we have an edge from $v$ to $w$, where the new element $a$ should be in between. @@ -303,7 +312,11 @@ only leaves need to store the jump pointers. In other words, we make all queries start at leaves. \subsubsection{Step 6: Leaf Trimming} -We define a \emph{maximally deep node} as a node with $\geq 1/4 \log n$ descendants. +\label{sec:first-leaf-trim} +\begin{definition}{Maximally Deep Node} + A node with $\geq \frac{1}{4} \log n$ descendants. +\end{definition} + Split the tree in two layers by the maximally deep nodes. The number of leaves in the top part is now $O(n/\log n)$, since \todo{eh?} @@ -325,8 +338,267 @@ We end up with $O(1)$ time queries, using $O(n)$ space! \chapter{Strings} +\topics{String search, suffix trees, suffix arrays.} + +We want to to \emph{String Matching} efficient: given a \emph{text} $T$ and a \emph{pattern} $P$ +we want to see if $P \in T$, or to find all occurences of $P$ in $T$. +Both $P$ and $T$ are strings over some alphabet $\Sigma$. +There exists linear time algorithms to do this, like \algo{Knuth-Morris-Pratt}, +\algo{Boyer-Moore}, or \algo{Karp-Rabin}. + +We look at a static data structure on $T$, with the goal of getting string matching in $O(|P|)$ time and $O(|T|)$ space. + +\section{Warmup: Library Search} +Given a set of strings $T_1, \dots, T_k$, we query with a pattern $P$ and want to get $P$s predecessor among the $T$s. + +\begin{definition}{Trie} + A Trie sis a rooted tree with child branches labeled with letters in $\Sigma$. +Let $T$ be the number of nodes in the trie. This is bounded by $\sum^k_{i=1} T_i$ (equality if no pair of strings share a prefix). +\end{definition} +\todo{Add figure} + +A trie can encode multiple strings, by having the edges in a path from the root to a leaf spell out the string. +However, we need a terminal symbol $\$$ to denote the end of a string, so we can have prefixes of a string in the same trie\todo{Bad expl?}. +If each node traverses its edges in sorted order an in-order traversal of the trie yields the strings of the trie in sorted order. + +\subsection{Trie Representation} +How do we represent a trie? +More specifically, how does each node represent its children? + +\subsubsection{Array} +Nodes can store an array of length $\Sigma$, with pointers to the next nodes, or $\bot$ pointers signaling absence. +Query for a node in $O(1)$ time, predecessor can also be $O(1)$ by having each node point to its predecessor, since the tree is static. +The real downside of this approach is the space, since it is proportional to the alphabet size. + +\subsubsection{Balanced Binary Search Tree} +We can put nodes in a BBST for each child, with a pointer to the child node. +Each query is $O(\log\Sigma)$, and there are $O(T)$ queries to be done. +Predecessor is simple, and space is $O(T)$. + +\subsubsection{Hash Table} +We can use a hash table to map characters from $\Sigma$ to a pointer to the child node. +This is fast, so query is $O(P)$ in total. However, we do not support predecessor queries, which we need. +Space is $O(T)$. + +\subsubsection{van Emde Boas/y-fast tree} +Use a van Emde Boas tree for the children, which allows lookup in $O(\log\log\Sigma)$ time. +Also handles predecessor. +See Section~\ref{sec:vEB} for details on the van Emde Boas tree. + +\subsubsection{Hashing + vEB} +Have both a hash table \emph{and} a van Emde Boas tree. +Use the hash table until we get a miss, and use the tree to find its predecessor. +After this first miss, we only care about the largest string in the remaining subtree, +which can be explicitly stored for the hash tables, since it is only a constant overhead per node. +This makes the query time $O(P + \log\log\Sigma)$, since we only use the tree once. + +\subsubsection{Weight Balanced BST} +Instead of having a balanced BST over each nodes children, we can weight each child with the number of leaves in its subtree. +This ensures that every second jump in the BST either reduces the number of candidate strings \emph{in the trie} to $2/3$ of its size, +or it finds a new trie node in the WBBST (hence we advance $P$ one letter). +An intuition for this claim is this\todo{Add array figure}: +we might be so lucky as to cut out $1/2$ of the leaves when leaving a node, unless there is some really heavy child in the middle +(remember we have to retain ordering). +But then in the next step this large child will surely be either to the far left or to the far right, which means we either follow it +(which gets us to a new node in the trie), or we discard it, discarding a large number of leaves. + +We end up with a running time of $O(P + \log k)$ where $k$ is the number of leaves, and remain at $O(T)$ space. + +\subsubsection{Leaf Trimming} +\begin{definition}{Maximally Deep Node} + A node with $\geq \Sigma$ leaf descendants. +\end{definition} + +As in Section~\ref{sec:first-leaf-trim} we cut the tree in two parts, the top part and the bottom parts. +The tree is cut at the minimal maximally deep nodes. +Now the number of leaves in the top part is $\leq |T| / |\Sigma|$, +and the number of branching nodes in the top part is \emph{also} $\leq |T| / |\Sigma|$. +Now we can use arrays in the top part of the tree as well as for the cut nodes, since +there are only $|T| / |\Sigma|$ nodes, so the $\Sigma$s cancel for the space. +For the non-branching nodes, nodes that only have one descendant, we can simply store its decendant and its label. +For the bottom part of the tree we can use leaf trimming, since $k=\Sigma$, by the definition of maximally deep nodes. + +We end up using $O(T)$ space for the top structure, and get $O(P + \log\Sigma)$ query time. + +Table~\ref{tab:trie-node-repr} sums up all approaches, with query time and space requirement. + +\begin{table}[h] + \centering + \begin{tabular}{l l l l} + & Node representation & Query & Space\\\midrule + 1& Array & $O(P)$ & $O(T\Sigma)$ \\ + 2& Balances BST & $O(P \log \Sigma)$ & $O(T)$ \\ + 3& Hash Table & $O(P)$ & $O(T)$ \\ + 3.5& van Emde Boas/y-fast tree & $O(P\log\log\Sigma)$ & $O(T)$ \\ + 3.75& (3) + (3.5) & $O(P + \log\log\Sigma)$ & $O(T)$ \\ + 4 & Weight-balanced BST & $O(P + \log k)$ & $O(T)$\\ + 5 & Leaf Trimming + (1) + (4) & $O(P + \log\Sigma)$ & $O(T)$ + \end{tabular} + \caption{Table over node representation, query time, and space requirement.} +\label{tab:trie-node-repr} +\end{table} + +\section{Suffix Tree} +\begin{definition}{Compressed Trie} + A trie where all internal nodes are branch nodes, and the edge labels are strings over $\Sigma$ rather than characters. +\end{definition} + +A Suffix Tree is a compressed trie of all suffixes of a string. +The tree has $|T| + 1$ leaves: one leaf for each suffix, including the empty string. +The edge labels are typically stored as indices in the string, instead of the string itself. +Instead of appending $\$$ to each of the suffixes, which are the strings we are inserting into the tree, +we can simply append $\$$ to the string $T$, since this will make it the last character in all of the suffixes. +The structure takes $O(T)$ space. +\todo{add figure} + +\subsection{Applications of Suffix Trees} +Suffix trees are surprisingly useful, and with some of the results from Chapter~\ref{ch:statictrees} +we can get some impressive results. + +\subsubsection{String Matching} +A search for $P$ in the tree yields a node which leaves corresponds to all matches of $P$ in the text. +The search is done in $O(P)$ time (trivial). +We can then list the first $k$ occurences of $P$ in $O(k)$ time, by simply traversing the subtree. +If we precompute the number of leaves below all nodes, we can retrieve the number of occurences of a string in +$O(1)$ time after the inital search, making the total time $O(P)$. + +\subsubsection{Longest Repeating Substring} +We are looking for the longest substring $S \subseteq T$ that is present at least twice. +This can be done in $O(T)$ time using suffix trees, since it is the branching node of maximum ``letter depth''. + +\subsubsection{Longest Substring Match for $i$, $j$} +How long is the longest common substring for $T[i..]$ and $T[j..]$? +Find the two nodes \LCA{} in $O(1)$ time to get the common prefix. + +\subsubsection{Something more} +\todo{here} + +\subsubsection{TODO this} +\todo{here} + +\section{Suffix Arrays} +While suffix trees are constructable in $O(T)$ time, it is difficult. +We look at a simpler structure, the Suffix Array, which is equivalent to a suffix tree, but easier, although slower, to construct. + +\begin{definition}{Suffix Array} + An array $A$ of indices into a text $T$ such that $a_i$ gives the suffixes $T[a_i:]$ in sorted order. +\end{definition} + +We let $\forall_{\sigma\in\Sigma}\ \$ < \sigma$. +Suffix arrays are searchable in $O(P \log T)$ time using binary search. +In addition to the suffixes, the suffix array also holds information about the difference in adjacent suffixes: +This allows for faster searching, as we know how much of the adjacent suffix matches what we already have matched. + +\begin{definition}{\algo{LCP[$i$]}} + Longest common prefix of $i$th and $i+1$th suffix in the order they are in the suffix array. +\end{definition} + +\begin{example} + Consider $T=\texttt{banana\$}$. The suffixes are shown in Table~\ref{tab:suffixes}, and the suffix array is shown in Table~\ref{tab:suffix-array}. + Note that the suffixes are not stored explicitly in the array, but only the indices $i$ and the \algo{LCP}. + + \begin{figure}[h] + \centering + \begin{subfigure}{0.45\textwidth} + \centering + \begin{tabular}{l l} + $i$ & Suffix\\\midrule + 0 & \texttt{banana\$}\\ + 1 & \texttt{anana\$}\\ + 2 & \texttt{nana\$}\\ + 3 & \texttt{ana\$}\\ + 4 & \texttt{na\$}\\ + 5 & \texttt{a\$}\\ + 6 & \texttt{\$} + \end{tabular} + \caption{The suffixes of $T$ +\label{tab:suffixes}} + \end{subfigure} + \begin{subfigure}{0.45\textwidth} + \centering + \begin{tabular}{l l l} + $i$ & Suffix & \algo{LCP}\\\midrule + 6 & \texttt{\$} & 0 \\ + 5 & \texttt{a\$} & 1 \\ + 3 & \texttt{ana\$} & 3 \\ + 1 & \texttt{anana\$} & 0 \\ + 0 & \texttt{banana\$} & 0 \\ + 4 & \texttt{na\$} & 2 \\ + 2 & \texttt{nana\$} & $-$ + \end{tabular} + \caption{The Suffix Array of $T$. +\label{tab:suffix-array} + } + \end{subfigure} + \end{figure} +\end{example} -String search, suffix trees, suffix arrays. +\subsection{Suffix Tree $\rightarrow$ Suffix Array} +This way is simple: traverse the tree in-order. + +\subsection{Suffix Array $\rightarrow$ Suffix Tree} +We make a Cartesian Tree (see Section~\ref{sec:cartesian-tree}) of the \algo{LCP} array. +This time we put \emph{all} minimum values at the root\footnote{note that the number of 0s in the array is equal to the number of different characters in $T$}. +The suffixes of $T$ are the leaves of the tree. +Note that the \algo{LCP} value of the internal nodes is the letter depth of that node, +\todo{Add figure} +so the edge length between two internal nodes is the difference in \algo{LCP}. +We know from Section~\ref{sec:cartesian-tree} that this is doable in linear time. + +\subsection{Construction} +If we have the suffix array it is possible to construct the \algo{LCP} array in linear time\todo{ref}. +We look at a method of constructing the suffix array from scratch in $O(T + \text{sort}(\Sigma))$ time. + +\begin{definition}{$\big<a, b\big>$} + Let $\big<a, b\big>$ denote the concatenation of the strings $a$ and $b$. +\end{definition} + +\subsubsection{Step 1} +Sort $\Sigma$. This step can be skipped if we do not need the children of the suffix tree nodes in order. +If we skip this step, the construction is done in $O(T)$ time. + +\subsubsection{Step 2} +Replace each letter in the text by its rank in the sorted array. +By doing this we guarantee that $|\Sigma'| \leq |T|$, in case $|\Sigma|$ is really large. + +\subsubsection{Step 3} +Let +\begin{align*} + T_0 &= \big<(T[3i+0], T[3i+1], T[3i+2]) \text{\ for\ } i=0,1,2,\dots\big>\\ + T_1 &= \big<(T[3i+1], T[3i+2], T[3i+3]) \text{\ for\ } i=0,1,2,\dots\big>\\ + T_2 &= \big<(T[3i+2], T[3i+3], T[3i+4]) \text{\ for\ } i=0,1,2,\dots\big> +\end{align*} +If $T=\texttt{banana}$, +$T_0 = \big<\texttt{ban}\ \texttt{ana}\big>$, +$T_1 = \big<\texttt{ana}\ \texttt{na}\big>$, and +$T_2 = \big<\texttt{nan}\ \texttt{a}\big>$. +We think of the elements of $T_i$ as ``letters''. +Our goal is to use the $T$s to construct the suffix array for $T$. +Note that $\textsc{Suffix}(T) \cong \textsc{Suffix}(T_0) \cup \textsc{Suffix}(T_1) \cup \textsc{Suffix}(T_2)$. + +If $T=\texttt{banana}$, +$\textsc{Suffix}(T_0) = \{\texttt{ban\ ana},\ \texttt{ana},\ \epsilon\}$, since we operate on the triplets as letters. + +\subsubsection{Step 4} +Recurse on $\big<T_0, T_1\big>$. End up with a suffix array of the suffixes in $\big<T_0, T_1\big>$. +Note that the number of suffixes when recursing is $O(2/3)$ of what we started with. + +\subsubsection{Step 5} +Now we want to use the suffix array just obtained to radix sort the suffixes in $T_2$. +Note that $$T_2[i..] = T[3i+2..] = \big<T[3i+2],\ T[3i+3..]\big> = \big<T[3i+2],\ T_0[i+1..]\big>.$$ +We can take off the first letter of the suffix, and get a suffix which we know the sorted order or, since it is in $T_0$, which we sorted in Step 4. +This is like Radix sort, but with only two values, both of which can be compared in $O(1)$. +This allows us to sort $T_2$. + +\subsubsection{Step 6} +Merge $T_2$ into $T_0, T_1$. +While this seems straight forward (merging is $O(n)$) we still need to make sure that the suffix comparisons are done in constant time. +This is done in a similar fashion to Step 5: take off the first letter in each suffix and rewrite the suffix in terms of +a single letter + suffixes we already know the order of. + +Since all operations are linear, with the exception of the recursive call, we get the following recurrence: +$T(n) = T(\frac{2}{3} n) + O(n) = O(n)$. +Linear time! \chapter{Temporal Structures} @@ -347,6 +619,9 @@ Dynamic partial sums, dynamic connectivity. van Emde Boas, x-fast trees, y-fast trees, fusion trees. +\section{van Emde Boas Tree} +\label{sec:vEB} + \chapter{Succinct Structures} Rank, Select |
