Merge pull request #266 from hyanwong/fill-out-topology

benjeffery · web-flow · commit 1f60c73b2b7c · 2023-11-21T22:26:14.000Z
Fill out topology counting tutorial
diff --git a/_toc.yml b/_toc.yml
@@ -12,6 +12,7 @@ parts:
   chapters:
   - file: analysing_tree_sequences
   - file: analysing_trees
+  - file: incremental_algorithms
   - file: counting_topologies
   - file: parallelization
 - caption: Further tskit tutorials
diff --git a/counting_topologies.md b/counting_topologies.md
@@ -16,13 +16,309 @@ kernelspec:
 
 (sec_counting_topologies)=
 
-# _Counting topologies_
-% remove underscores in title when tutorial is complete or near-complete
+# Counting topologies
 
+**Yan Wong**
 
-:::{todo}
-Add examples of using {meth}`TreeSequence.count_topologies` and introduce {ref}`sec_combinatorics`
-in a gentle way.
+This tutorial is intended to be a gentle introduction to the combinatorial
+treatment of tree topologies in `tskit`. For a more formal introduction,
+see the {ref}`sec_combinatorics` section of the
+[official `tskit` documentation](tskit:sec_introduction).
 
-See https://github.com/tskit-dev/tutorials/issues/93
-:::
+The *topology* of a single tree is the term used to describe the branching pattern,
+regardless of the lengths of the branches. For example, both trees below have the
+same topology, although the branch lengths differ:
+
+```{code-cell}
+import tskit
+node_labels = {0: "a", 1: "b", 2: "c"}  # avoid confusion by using letters to label tips
+tree = tskit.Tree.generate_comb(3)
+display(tree.draw_svg(node_labels=node_labels, y_axis=True))
+
+deep_tree = tskit.Tree.generate_comb(10).tree_sequence.simplify([0, 1, 2]).first()
+display(deep_tree.draw_svg(node_labels=node_labels, y_axis=True))
+```
+
+:::{note}
+The treatment of topologies in `tskit` is restricted to trees with a single defined root,
+without nodes with a single child (i.e. trees must consist of nodes that are either leaves,
+or internal nodes with two or more children).  For convenience in the examples
+below, trees are drawn with the tips flagged as samples, although whether a node is a sample or
+not does not change the topology of the tree.
+:::
+
+## Tree labellings and shapes
+
+The topology of a tree also takes into account the labelling of tips, so that
+the trees below, although they have the same *shape*, count as three
+different topologies:
+
+```{code-cell}
+:tags: [hide-input]
+from string import ascii_lowercase
+from IPython.display import SVG
+
+def str_none(s, prefix=None):
+    if s is not None:
+        if prefix is None:
+            return str(s)
+        else:
+            return prefix + " = " + str(s)
+    return None
+
+def draw_svg_trees(trees, node_labels={}, x_lab_attr=None, width=100, height=150, space=10):
+    w = width + space
+    h = height + space
+    trees = list(trees)
+    s = f'<svg height="{h}" width="{w * len(trees)}" xmlns="http://www.w3.org/2000/svg">'
+    s += f'<style>.x-axis {{transform: translateY({space}px)}}</style>'
+    for i, tree in enumerate(trees):
+        s += tree.draw_svg(
+            size=(width, height),
+            canvas_size=(w, h),
+            root_svg_attributes={"x": i * w},
+            node_labels=node_labels,
+            x_label=str_none(getattr(tree.rank(), x_lab_attr or "", None), x_lab_attr)
+        )
+    s += '</svg>'
+    return SVG(s)
+
+draw_svg_trees(tskit.all_tree_labellings(tree), node_labels={u: ascii_lowercase[u] for u in tree.samples()})
+```
+
+These are, in fact, the only possible three labellings for a three-tip tree of that shape.
+There is only one other possible shape for a three-tip tree, and for this shape,
+all labelling orders are equivalent (in other words, there is only one
+possible labelling):
+
+```{code-cell}
+:tags: [hide-input]
+tskit.Tree.generate_star(3).draw_svg(node_labels={})
+```
+
+A 3-tip tree therefore has only four possible topologies.
+These can be generated with the {func}`~tskit.all_trees` function.
+
+```{code-cell}
+generated_trees = tskit.all_trees(3)
+print("For a three-tip tree there are", len(list(generated_trees)), "labelled topologies.")
+```
+
+Here they are, plotted out with their shapes enumerated from zero:
+
+```{code-cell}
+:tags: [hide-input]
+draw_svg_trees(
+    tskit.all_trees(3),
+    node_labels={u: ascii_lowercase[u] for u in tree.samples()},
+    x_lab_attr="shape"
+)
+```
+
+### Enumerating shapes and labellings
+
+For a tree with four tips, more topologies and shapes are possible. As before, we can generate the
+topologies using {func}`~tskit.all_trees`. Alternatively, if we only want the (unlabelled) shapes,
+we can use the {func}`~tskit.all_tree_shapes` function:
+
+```{code-cell}
+print("For a four-tip tree there are", len(list(tskit.all_trees(4))), "labelled topologies.")
+
+generated_trees = tskit.all_tree_shapes(4)
+print("These can be categorised into", len(list(generated_trees)), "shapes.")
+```
+
+Again, we can give each shape a number or *index*, starting from zero:
+
+```{code-cell}
+:tags: [hide-input]
+draw_svg_trees(tskit.all_tree_shapes(4), x_lab_attr="shape")
+```
+
+Each of these shapes will have a separate number of possible labellings, and trees with
+these labellings can be created using {func}`~tskit.all_tree_labellings`:
+
+```{code-cell}
+for shape_index, tree in enumerate(tskit.all_tree_shapes(4)):
+    labellings = tskit.all_tree_labellings(tree)
+    num_labellings = len(list(labellings))
+    print(
+        f"Tree shape {shape_index} for a four-tip tree has "
+        f"{num_labellings} labelling{'' if num_labellings==1 else 's'}."
+    )
+```
+
+Any tree topology for a tree of $N$ tips can therefore be described by a
+shape index combined with a labelling index. This is known as the
+*rank* of a tree, and it can be obtained using the
+{meth}`Tree.rank` method. For instance, here is the rank of a simulated tree
+of 10 tips:
+
+```{code-cell}
+:tags: [hide-input]
+import msprime
+num_tips = 10
+simulated_ts = msprime.sim_ancestry(10, ploidy=1, random_seed=123)
+simulated_tree = simulated_ts.first()
+print("The topology of the simulated tree below can be described as", simulated_tree.rank())
+ascii_node_labels = {u: ascii_lowercase[u] for u in simulated_tree.samples()}
+simulated_tree.draw_svg(node_labels=ascii_node_labels)
+```
+
+
+A tree with the same topology (i.e. the same shape and labelling, but ignoring
+the branch lengths) can be generated using the {meth}`Tree.unrank` method, by
+specifying the number of tips and the appropriate `(shape, labelling)` tuple:
+
+```{code-cell}
+new_tree = tskit.Tree.unrank(num_tips, (1270, 21580))
+new_tree.draw_svg(node_labels=ascii_node_labels)
+```
+
+Note that this method generates a single tree in a new tree sequence
+whose a default sequence length is 1.0.
+
+## Methods for large trees
+
+The number of possible topologies for a tree with $N$ tips
+grows very rapidly with $N$. For instance, with 10 tips, there are
+282,137,824 possible topologies.
+
+For this reason, the {func}`~tskit.all_trees`, {func}`~tskit.all_tree_shapes` and
+{func}`~tskit.all_tree_labellings` methods do not return a list of trees
+but an iterator over the trees. This means it is perfectly possible to start
+iterating over (say) all tree shapes for a tree of 100 leaves, but
+the iterator will not finish before the death of our galaxy.
+
+```{code-cell}
+for num_trees, tree in enumerate(tskit.all_tree_shapes(100)):
+    shape = tree.rank().shape
+    b2 = tree.b2_index()
+    print(f"A 100-tip tree with shape index {shape} has a b2 balance index of {b2}")
+    if num_trees > 5:
+      break  # better not let this run too long!
+```
+
+For similar combinatorial reasons, the {meth}`Tree.rank` method can be
+inefficient for large trees. To compare the topology of two trees, you are
+therefore recommended to use e.g. the {meth}`Tree.kc_distance` method
+rather than comparing ranks directly.
+
+```{code-cell}
+simulated_tree = simulated_ts.first(sample_lists=True)  # kc_distance requires sample lists
+if simulated_ts.first(sample_lists=True).kc_distance(simulated_tree) == 0:
+    print("Trees are identical")
+    # To compare to the new_tree we need to fix 
+    # print("The simulated and topology-constructed trees have the same topology")
+```
+
+Despite the combinatorial explosion associated with topologies of
+many-tip trees, it is still possible to efficiently count
+the number of *embedded topologies* in a large tree.
+
+### Embedded topologies
+
+An embedded topology is a a topology involving a subset of the tips of a tree.
+If the tips are classified into (say) three groups, red, green, and blue,
+we can efficiently count all the embedded three-tip trees which have
+one tip from each group using the {meth}`Tree.count_topologies` method.
+
+```{code-cell}
+:tags: [hide-input]
+
+newick_species_tree = "((A:100.0,B:100.0):100.0,C:200.0)"
+demography = msprime.Demography.from_species_tree(newick_species_tree, initial_size=100)
+ts = msprime.sim_ancestry({0: 2, 1: 2, 2: 2}, demography=demography, random_seed=321)
+big_tree = ts.first()
+styles = [
+  f".node.sample.p{p.id} > .sym " + "{" + f"fill: {colour}" + "}"
+  for colour, p in zip(['red', 'green', 'blue'], big_tree.tree_sequence.populations())
+]
+big_tree.draw_svg(style="".join(styles), node_labels={}, time_scale="rank", x_label="big_tree")
+```
+
+In this tree, it is fairly clear that the red and green tips cluster together (although not exclusively),
+so that if we took one red, one blue, and one green tip at random, we nearly always see the
+same embedded topology. The {meth}`Tree.count_topologies` method does this exhaustively:
+
+```{code-cell}
+# By default `count_topologies` chooses one tip from each population, like setting
+# sample_sets=[ts.samples(p.id) for p in ts.populations() if len(ts.samples(p.id)) > 0]
+
+topology_counter = big_tree.count_topologies()
+
+colours = ['red', 'green', 'blue']
+styles = [f".n{u}>.sym {{fill: {c} }}" for u, c in enumerate(colours)]
+
+embedded_counts = topology_counter[0, 1, 2]
+for embedded_tree in tskit.all_trees(3):
+    rank = embedded_tree.rank()
+    number_of_instances = embedded_counts[rank]
+    label = f"{number_of_instances} instances embedded in big_tree"
+    display(embedded_tree.draw_svg(style="".join(styles), node_labels={}, x_label=label))
+```
+
+## Methods over tree sequences
+
+It can be useful to count embedded topologies over an entire tree sequence.
+For instance, we might want to know the number of embedded topologies
+that support Neanderthals as a sister group to europeans versus africans.
+`Tskit` provides the efficient {meth}`TreeSequence.count_topologies` method to
+do this [incrementally](sec_incremental), without having to re-count the topologies
+independently in each tree.
+
+```{code-cell}
+import stdpopsim
+species = stdpopsim.get_species("HomSap")
+model = species.get_demographic_model("OutOfAfrica_3G09")
+contig = species.get_contig("chr1", length_multiplier=0.0002, mutation_rate=model.mutation_rate)
+samples = {"YRI": 1000, "CEU": 1000, "CHB": 1000}
+engine = stdpopsim.get_engine("msprime")
+ts = engine.simulate(model, contig, samples)
+print("Simulated", ts.num_trees, "African+European+Chinese trees, each with", ts.num_samples, "tips")
+```
+
+Although the trees in this tree sequence are very large, counting the embedded topologies is
+quite doable (for speed we are only simulating 0.02% of chromosome 1, but calculating the
+average over an entire chromosome simply takes a little longer)
+
+```{code-cell}
+from datetime import datetime
+names = {"YRI": "African", "CEU": "European", "CHB": "Chinese"}
+colours = {"YRI": "yellow", "CEU": "green", "CHB": "blue"}
+
+population_map = {p.metadata["id"]: p.id for p in ts.populations()}
+sample_populations = [population_map[name] for name in samples.keys()]
+topology_span = {tree.rank(): 0 for tree in tskit.all_trees(len(sample_populations))}
+
+start = datetime.now()
+total = 0
+for topology_counter, tree in zip(ts.count_topologies(), ts.trees()):
+    embedded_topologies = topology_counter[sample_populations]
+    weight = tree.span / ts.sequence_length
+    for rank, count in embedded_topologies.items():
+        topology_span[rank] += count * weight
+        total += count
+print(f"Counted {total} embedded topologies in {datetime.now() - start} seconds")
+```
+
+```{code-cell}
+:tags: [hide-input]
+ntips = len(sample_populations)
+styles = ".sample text.lab {baseline-shift: super; font-size: 0.7em;}"
+node_labels = {}
+
+for p in range(ntips):
+    name = ts.population(sample_populations[p]).metadata["id"]
+    node_labels[p] = names[name]
+    styles += f".n{p}>.sym {{fill: {colours[name]} }}"
+
+total = sum(topology_span.values())
+for rank, weight in topology_span.items():
+    label = f"{weight/total *100:.1f}% of genome"
+    embedded_tree = tskit.Tree.unrank(ntips, rank)
+    display(embedded_tree.draw_svg(size=(160, 150), style="".join(styles), node_labels=node_labels, x_label=label))
+```
+
+For an example with real data, see {ref}`sec_popgen_topological`
+in the {ref}`sec_intro_popgen` tutorial.
diff --git a/incremental_algorithms.md b/incremental_algorithms.md
@@ -0,0 +1,32 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.12
+    jupytext_version: 1.9.1
+kernelspec:
+  display_name: Python 3
+  language: python
+  name: python3
+---
+
+```{currentmodule} tskit
+```
+
+(sec_incremental)=
+# _Incremental algorithms_
+
+Much of the [efficiency](sec_what_is_analysis)
+of the tskit approach comes from the use of incremental algorithms.
+By considering only the difference between adjacent trees,
+incremental algorithms avoid having to perform the same
+calculation multiple times on different trees.
+
+This tutorial will explain the philosophy behind incremental algorithms,
+and provide examples of how to create your own (e.g. using the
+{meth}`TreeSequence.edge_diffs` method).
+
+:::{todo}
+Create content. See [issue 233](https://github.com/tskit-dev/tutorials/issues/233)
+:::
diff --git a/popgen.md b/popgen.md
@@ -316,8 +316,11 @@ could also have been used to produce the plot above. In this case, the advantage
 using tree sequences is simply that they allow these sorts of analysis to
 {ref}`scale<plot_incremental_calculation>` to datasets of millions of whole genomes.
 
+(sec_popgen_topological)=
 
-However, because this inferred tree sequence stores (an estimate of) the underlying
+### Topological analysis
+
+As this inferred tree sequence stores (an estimate of) the underlying
 genealogy, we can also derive statistics based on genealogical relationships. For
 example, this tree sequence also contains a sample genome based on an ancient
 genome, a [Denisovan](https://en.wikipedia.org/wiki/Denisovan) individual. We can
@@ -329,6 +332,9 @@ Show an example of looking at topological relationships between the Denisovan an
 various East Asian groups, using the {ref}`sec_counting_topologies` functionality.
 :::
 
+See {ref}`sec_counting_topologies` for an introduction to topological methods in
+`tskit`.
+
 ## Further information
 
 This brief introduction is meant as a simple taster. Many other efficient population