Finish strings

author: Martin Hafskjold Thoresen <martinhath@gmail.com> 2017-07-06 10:05:25 +0200
committer: Martin Hafskjold Thoresen <martinhath@gmail.com> 2017-07-06 10:05:25 +0200
commit: 223f9233078029424e2a0c9c8ba9a762fa975ba3 (patch)
tree: 4af28ccabafd83131e076b74d06375f8b286fb69 /book.tex
parent: 882dda0271204d6eee7bb53904bdc303a26db3a3 (diff)
download: ads-book-223f9233078029424e2a0c9c8ba9a762fa975ba3.tar.gz
ads-book-223f9233078029424e2a0c9c8ba9a762fa975ba3.zip
1 files changed, 280 insertions, 5 deletions
diff --git a/book.tex b/book.tex
index dd9509e..209b474 100644
--- a/book.tex
+++ b/book.tex
@@ -1,7 +1,10 @@
 \documentclass[a4paper]{book}
 \usepackage[a4paper, top=3cm,bottom=3cm, left=3cm, right=3cm%
-%, paperheight=23cm
+, paperheight=23cm
 ]{geometry}
+\usepackage{graphicx}
+\usepackage{subcaption}
+\usepackage{booktabs}
 \usepackage{amsmath}
 \usepackage{amsfonts}
 \usepackage{mathtools}
@@ -12,6 +15,10 @@
 \usepackage[backend=bibtex]{biblatex}
 \bibliography{bibtex}
 
+\usepackage{cleveref}
+\usepackage{subcaption}
+\captionsetup[subfigure]{subrefformat=simple,labelformat=simple}
+
 \usepackage{lipsum}
 
 \title{Advanced Data Structures}
@@ -20,6 +27,7 @@
 
 \newcommand{\topics}[1]{Topics: \textit{#1}}
 \newcommand{\code}[2]{\textsc{#1}$(#2)$}
+\newcommand{\algo}[1]{\textsc{#1}}
 
 \newcommand{\RMQ}{$\textsc{RMQ}$}
 \newcommand{\LCA}{$\textsc{LCA}$}
@@ -27,7 +35,7 @@
 
 
 \newenvironment{example}[0]{{\parindent=0em \textbf{Example:}}}{\vspace{1em}}
-\newenvironment{definition}[1]{{\parindent=0em \textbf{Definition} --- \emph{#1}:\\}}{\vspace{1em}}
+\newenvironment{definition}[1]{{\parindent=0em \textbf{Definition} --- \emph{#1}:}}{\vspace{1em}}
 
 \DeclareMathOperator*{\argmin}{arg\,min} % thin space, limits underneath in displays
 \DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
@@ -163,6 +171,7 @@ If the hashing functions are fully random or $\log n$-wise independent we get $O
 
 
 \chapter{Static Tree Queries}
+\label{ch:statictrees}
 
 \topics{Range Minimum Query (\code{RMQ}{i, j}), Lowest Common Ancestor (\code{LCA}{x, y}), Level Ancestor (\code{LA}{x, n}), Range Queries, Range Trees.}
 
@@ -180,7 +189,7 @@ It turns out that \LCA{} and \RMQ{} are equivalent.
 \subsection{Reduction from \RMQ{} to \LCA{}}
 
 \todo{Add tree from lecture notes}
-Build a \emph{Cartesian Tree}:
+Build a \emph{Cartesian Tree\label{sec:cartesian-tree}}:
 Walk through the array, while keeping track of the right spine of the tree.
 When inserting a new element, if the element is the largest element in the spine, insert it at the end.
 Else, we have an edge from $v$ to $w$, where the new element $a$ should be in between.
@@ -303,7 +312,11 @@ only leaves need to store the jump pointers.
 In other words, we make all queries start at leaves.
 
 \subsubsection{Step 6: Leaf Trimming}
-We define a \emph{maximally deep node} as a node with $\geq 1/4 \log n$ descendants.
+\label{sec:first-leaf-trim}
+\begin{definition}{Maximally Deep Node}
+    A node with $\geq \frac{1}{4} \log n$ descendants.
+\end{definition}
+
 Split the tree in two layers by the maximally deep nodes.
 The number of leaves in the top part is now $O(n/\log n)$, since
 \todo{eh?}
@@ -325,8 +338,267 @@ We end up with $O(1)$ time queries, using $O(n)$ space!
 
 \chapter{Strings}
 
+\topics{String search, suffix trees, suffix arrays.}
+
+We want to to \emph{String Matching} efficient: given a \emph{text} $T$ and a \emph{pattern} $P$
+we want to see if $P \in T$, or to find all occurences of $P$ in $T$.
+Both $P$ and $T$ are strings over some alphabet $\Sigma$.
+There exists linear time algorithms to do this, like \algo{Knuth-Morris-Pratt},
+\algo{Boyer-Moore}, or \algo{Karp-Rabin}.
+
+We look at a static data structure on $T$, with the goal of getting string matching in $O(|P|)$ time and $O(|T|)$ space.
+
+\section{Warmup: Library Search}
+Given a set of strings $T_1, \dots, T_k$, we query with a pattern $P$ and want to get $P$s predecessor among the $T$s.
+
+\begin{definition}{Trie}
+    A Trie sis a rooted tree with child branches labeled with letters in $\Sigma$.
+Let $T$ be the number of nodes in the trie. This is bounded by $\sum^k_{i=1} T_i$ (equality if no pair of strings share a prefix).
+\end{definition}
+\todo{Add figure}
+
+A trie can encode multiple strings, by having the edges in a path from the root to a leaf spell out the string.
+However, we need a terminal symbol $\$$ to denote the end of a string, so we can have prefixes of a string in the same trie\todo{Bad expl?}.
+If each node traverses its edges in sorted order an in-order traversal of the trie yields the strings of the trie in sorted order.
+
+\subsection{Trie Representation}
+How do we represent a trie?
+More specifically, how does each node represent its children?
+
+\subsubsection{Array}
+Nodes can store an array of length $\Sigma$, with pointers to the next nodes, or $\bot$ pointers signaling absence.
+Query for a node in $O(1)$ time, predecessor can also be $O(1)$ by having each node point to its predecessor, since the tree is static.
+The real downside of this approach is the space, since it is proportional to the alphabet size.
+
+\subsubsection{Balanced Binary Search Tree}
+We can put nodes in a BBST for each child, with a pointer to the child node.
+Each query is $O(\log\Sigma)$, and there are $O(T)$ queries to be done.
+Predecessor is simple, and space is $O(T)$.
+
+\subsubsection{Hash Table}
+We can use a hash table to map characters from $\Sigma$ to a pointer to the child node.
+This is fast, so query is $O(P)$ in total. However, we do not support predecessor queries, which we need.
+Space is $O(T)$.
+
+\subsubsection{van Emde Boas/y-fast tree}
+Use a van Emde Boas tree for the children, which allows lookup in $O(\log\log\Sigma)$ time.
+Also handles predecessor.
+See Section~\ref{sec:vEB} for details on the van Emde Boas tree.
+
+\subsubsection{Hashing + vEB}
+Have both a hash table \emph{and} a van Emde Boas tree.
+Use the hash table until we get a miss, and use the tree to find its predecessor.
+After this first miss, we only care about the largest string in the remaining subtree,
+which can be explicitly stored for the hash tables, since it is only a constant overhead per node.
+This makes the query time $O(P + \log\log\Sigma)$, since we only use the tree once.
+
+\subsubsection{Weight Balanced BST}
+Instead of having a balanced BST over each nodes children, we can weight each child with the number of leaves in its subtree.
+This ensures that every second jump in the BST either reduces the number of candidate strings \emph{in the trie} to $2/3$ of its size,
+or it finds a new trie node in the WBBST (hence we advance $P$ one letter).
+An intuition for this claim is this\todo{Add array figure}:
+we might be so lucky as to cut out $1/2$ of the leaves when leaving a node, unless there is some really heavy child in the middle
+(remember we have to retain ordering).
+But then in the next step this large child will surely be either to the far left or to the far right, which means we either follow it
+(which gets us to a new node in the trie), or we discard it, discarding a large number of leaves.
+
+We end up with a running time of $O(P + \log k)$ where $k$ is the number of leaves, and remain at $O(T)$ space.
+
+\subsubsection{Leaf Trimming}
+\begin{definition}{Maximally Deep Node}
+    A node with $\geq \Sigma$ leaf descendants.
+\end{definition}
+
+As in Section~\ref{sec:first-leaf-trim} we cut the tree in two parts, the top part and the bottom parts.
+The tree is cut at the minimal maximally deep nodes.
+Now the number of leaves in the top part is $\leq |T| / |\Sigma|$,
+and the number of branching nodes in the top part is \emph{also} $\leq |T| / |\Sigma|$.
+Now we can use arrays in the top part of the tree as well as for the cut nodes, since
+there are only $|T| / |\Sigma|$ nodes, so the $\Sigma$s cancel for the space.
+For the non-branching nodes, nodes that only have one descendant, we can simply store its decendant and its label.
+For the bottom part of the tree we can use leaf trimming, since $k=\Sigma$, by the definition of maximally deep nodes.
+
+We end up using $O(T)$ space for the top structure, and get $O(P + \log\Sigma)$ query time.
+
+Table~\ref{tab:trie-node-repr} sums up all approaches, with query time and space requirement.
+
+\begin{table}[h]
+    \centering
+    \begin{tabular}{l l l l}
+        & Node representation & Query & Space\\\midrule
+       1& Array & $O(P)$ & $O(T\Sigma)$ \\
+       2& Balances BST & $O(P \log \Sigma)$ & $O(T)$ \\
+       3& Hash Table & $O(P)$ & $O(T)$ \\
+       3.5& van Emde Boas/y-fast tree & $O(P\log\log\Sigma)$ & $O(T)$ \\
+        3.75& (3) + (3.5) & $O(P + \log\log\Sigma)$ & $O(T)$ \\
+        4 & Weight-balanced BST & $O(P + \log k)$ & $O(T)$\\
+        5 & Leaf Trimming + (1) + (4) & $O(P + \log\Sigma)$ & $O(T)$
+    \end{tabular}
+    \caption{Table over node representation, query time, and space requirement.}
+\label{tab:trie-node-repr}
+\end{table}
+
+\section{Suffix Tree}
+\begin{definition}{Compressed Trie}
+    A trie where all internal nodes are branch nodes, and the edge labels are strings over $\Sigma$ rather than characters.
+\end{definition}
+
+A Suffix Tree is a compressed trie of all suffixes of a string.
+The tree has $|T| + 1$ leaves: one leaf for each suffix, including the empty string.
+The edge labels are typically stored as indices in the string, instead of the string itself.
+Instead of appending $\$$ to each of the suffixes, which are the strings we are inserting into the tree,
+we can simply append $\$$ to the string $T$, since this will make it the last character in all of the suffixes.
+The structure takes $O(T)$ space.
+\todo{add figure}
+
+\subsection{Applications of Suffix Trees}
+Suffix trees are surprisingly useful, and with some of the results from Chapter~\ref{ch:statictrees}
+we can get some impressive results.
+
+\subsubsection{String Matching}
+A search for $P$ in the tree yields a node which leaves corresponds to all matches of $P$ in the text.
+The search is done in $O(P)$ time (trivial).
+We can then list the first $k$ occurences of $P$ in $O(k)$ time, by simply traversing the subtree.
+If we precompute the number of leaves below all nodes, we can retrieve the number of occurences of a string in
+$O(1)$ time after the inital search, making the total time $O(P)$.
+
+\subsubsection{Longest Repeating Substring}
+We are looking for the longest substring $S \subseteq T$ that is present at least twice.
+This can be done in $O(T)$ time using suffix trees, since it is the branching node of maximum ``letter depth''.
+
+\subsubsection{Longest Substring Match for $i$, $j$}
+How long is the longest common substring for $T[i..]$ and $T[j..]$?
+Find the two nodes \LCA{} in $O(1)$ time to get the common prefix.
+
+\subsubsection{Something more}
+\todo{here}
+
+\subsubsection{TODO this}
+\todo{here}
+
+\section{Suffix Arrays}
+While suffix trees are constructable in $O(T)$ time, it is difficult.
+We look at a simpler structure, the Suffix Array, which is equivalent to a suffix tree, but easier, although slower, to construct.
+
+\begin{definition}{Suffix Array}
+    An array $A$ of indices into a text $T$ such that $a_i$ gives the suffixes $T[a_i:]$ in sorted order.
+\end{definition}
+
+We let $\forall_{\sigma\in\Sigma}\ \$ < \sigma$.
+Suffix arrays are searchable in $O(P \log T)$ time using binary search.
+In addition to the suffixes, the suffix array also holds information about the difference in adjacent suffixes:
+This allows for faster searching, as we know how much of the adjacent suffix matches what we already have matched.
+
+\begin{definition}{\algo{LCP[$i$]}}
+    Longest common prefix of $i$th and $i+1$th suffix in the order they are in the suffix array.
+\end{definition}
+
+\begin{example}
+    Consider $T=\texttt{banana\$}$. The suffixes are shown in Table~\ref{tab:suffixes}, and the suffix array is shown in Table~\ref{tab:suffix-array}.
+    Note that the suffixes are not stored explicitly in the array, but only the indices $i$ and the \algo{LCP}.
+
+    \begin{figure}[h]
+        \centering
+        \begin{subfigure}{0.45\textwidth}
+            \centering
+            \begin{tabular}{l l}
+                $i$ & Suffix\\\midrule
+                0 & \texttt{banana\$}\\
+                1 & \texttt{anana\$}\\
+                2 & \texttt{nana\$}\\
+                3 & \texttt{ana\$}\\
+                4 & \texttt{na\$}\\
+                5 & \texttt{a\$}\\
+                6 & \texttt{\$}
+            \end{tabular}
+            \caption{The suffixes of $T$
+\label{tab:suffixes}}
+        \end{subfigure}
+        \begin{subfigure}{0.45\textwidth}
+            \centering
+            \begin{tabular}{l l l}
+                $i$ & Suffix              & \algo{LCP}\\\midrule
+                6 & \texttt{\$}         & 0 \\
+                5 & \texttt{a\$}        & 1 \\
+                3 & \texttt{ana\$}      & 3 \\
+                1 & \texttt{anana\$}    & 0 \\
+                0 & \texttt{banana\$}   & 0 \\
+                4 & \texttt{na\$}       & 2 \\
+                2 & \texttt{nana\$}     & $-$
+            \end{tabular}
+            \caption{The Suffix Array of $T$.
+\label{tab:suffix-array}
+            }
+        \end{subfigure}
+    \end{figure}
+\end{example}
 
-String search, suffix trees, suffix arrays.
+\subsection{Suffix Tree $\rightarrow$ Suffix Array}
+This way is simple: traverse the tree in-order.
+
+\subsection{Suffix Array $\rightarrow$ Suffix Tree}
+We make a Cartesian Tree (see Section~\ref{sec:cartesian-tree}) of the \algo{LCP} array.
+This time we put \emph{all} minimum values at the root\footnote{note that the number of 0s in the array is equal to the number of different characters in $T$}.
+The suffixes of $T$ are the leaves of the tree.
+Note that the \algo{LCP} value of the internal nodes is the letter depth of that node,
+\todo{Add figure}
+so the edge length between two internal nodes is the difference in \algo{LCP}.
+We know from Section~\ref{sec:cartesian-tree} that this is doable in linear time.
+
+\subsection{Construction}
+If we have the suffix array it is possible to construct the \algo{LCP} array in linear time\todo{ref}.
+We look at a method of constructing the suffix array from scratch in $O(T + \text{sort}(\Sigma))$ time.
+
+\begin{definition}{$\big<a, b\big>$}
+    Let $\big<a, b\big>$ denote the concatenation of the strings $a$ and $b$.
+\end{definition}
+
+\subsubsection{Step 1}
+Sort $\Sigma$. This step can be skipped if we do not need the children of the suffix tree nodes in order.
+If we skip this step, the construction is done in $O(T)$ time.
+
+\subsubsection{Step 2}
+Replace each letter in the text by its rank in the sorted array.
+By doing this we guarantee that $|\Sigma'| \leq |T|$, in case $|\Sigma|$ is really large.
+
+\subsubsection{Step 3}
+Let
+\begin{align*}
+    T_0 &= \big<(T[3i+0], T[3i+1], T[3i+2]) \text{\ for\ } i=0,1,2,\dots\big>\\
+    T_1 &= \big<(T[3i+1], T[3i+2], T[3i+3]) \text{\ for\ } i=0,1,2,\dots\big>\\
+    T_2 &= \big<(T[3i+2], T[3i+3], T[3i+4]) \text{\ for\ } i=0,1,2,\dots\big>
+\end{align*}
+If $T=\texttt{banana}$,
+$T_0 = \big<\texttt{ban}\ \texttt{ana}\big>$,
+$T_1 = \big<\texttt{ana}\ \texttt{na}\big>$, and
+$T_2 = \big<\texttt{nan}\ \texttt{a}\big>$.
+We think of the elements of $T_i$ as ``letters''.
+Our goal is to use the $T$s to construct the suffix array for $T$.
+Note that $\textsc{Suffix}(T) \cong \textsc{Suffix}(T_0) \cup \textsc{Suffix}(T_1) \cup \textsc{Suffix}(T_2)$.
+
+If $T=\texttt{banana}$,
+$\textsc{Suffix}(T_0) = \{\texttt{ban\ ana},\ \texttt{ana},\ \epsilon\}$, since we operate on the triplets as letters.
+
+\subsubsection{Step 4}
+Recurse on $\big<T_0, T_1\big>$. End up with a suffix array of the suffixes in $\big<T_0, T_1\big>$.
+Note that the number of suffixes when recursing is $O(2/3)$ of what we started with.
+
+\subsubsection{Step 5}
+Now we want to use the suffix array just obtained to radix sort the suffixes in $T_2$.
+Note that $$T_2[i..] = T[3i+2..] = \big<T[3i+2],\ T[3i+3..]\big> = \big<T[3i+2],\ T_0[i+1..]\big>.$$
+We can take off the first letter of the suffix, and get a suffix which we know the sorted order or, since it is in $T_0$, which we sorted in Step 4.
+This is like Radix sort, but with only two values, both of which can be compared in $O(1)$.
+This allows us to sort $T_2$.
+
+\subsubsection{Step 6}
+Merge $T_2$ into $T_0, T_1$.
+While this seems straight forward (merging is $O(n)$) we still need to make sure that the suffix comparisons are done in constant time.
+This is done in a similar fashion to Step 5: take off the first letter in each suffix and rewrite the suffix in terms of
+a single letter + suffixes we already know the order of.
+
+Since all operations are linear, with the exception of the recursive call, we get the following recurrence:
+$T(n) = T(\frac{2}{3} n) + O(n) = O(n)$.
+Linear time!
 
 \chapter{Temporal Structures}
 
@@ -347,6 +619,9 @@ Dynamic partial sums, dynamic connectivity.
 
 van Emde Boas, x-fast trees, y-fast trees, fusion trees.
 
+\section{van Emde Boas Tree}
+\label{sec:vEB}
+
 \chapter{Succinct Structures}
 Rank, Select
author	Martin Hafskjold Thoresen <martinhath@gmail.com>	2017-07-06 10:05:25 +0200
committer	Martin Hafskjold Thoresen <martinhath@gmail.com>	2017-07-06 10:05:25 +0200
commit	223f9233078029424e2a0c9c8ba9a762fa975ba3 (patch)
tree	4af28ccabafd83131e076b74d06375f8b286fb69 /book.tex
parent	882dda0271204d6eee7bb53904bdc303a26db3a3 (diff)
download	ads-book-223f9233078029424e2a0c9c8ba9a762fa975ba3.tar.gz ads-book-223f9233078029424e2a0c9c8ba9a762fa975ba3.zip