@@ -16,13 +16,309 @@ kernelspec:
1616
1717(sec_counting_topologies)=
1818
19- # _ Counting topologies_
20- % remove underscores in title when tutorial is complete or near-complete
19+ # Counting topologies
2120
21+ ** Yan Wong**
2222
23- :::{todo}
24- Add examples of using {meth}` TreeSequence.count_topologies ` and introduce {ref}` sec_combinatorics `
25- in a gentle way.
23+ This tutorial is intended to be a gentle introduction to the combinatorial
24+ treatment of tree topologies in ` tskit ` . For a more formal introduction,
25+ see the {ref}` sec_combinatorics ` section of the
26+ [ official ` tskit ` documentation] ( tskit:sec_introduction ) .
2627
27- See https://github.com/tskit-dev/tutorials/issues/93
28- :::
28+ The * topology* of a single tree is the term used to describe the branching pattern,
29+ regardless of the lengths of the branches. For example, both trees below have the
30+ same topology, although the branch lengths differ:
31+
32+ ``` {code-cell}
33+ import tskit
34+ node_labels = {0: "a", 1: "b", 2: "c"} # avoid confusion by using letters to label tips
35+ tree = tskit.Tree.generate_comb(3)
36+ display(tree.draw_svg(node_labels=node_labels, y_axis=True))
37+
38+ deep_tree = tskit.Tree.generate_comb(10).tree_sequence.simplify([0, 1, 2]).first()
39+ display(deep_tree.draw_svg(node_labels=node_labels, y_axis=True))
40+ ```
41+
42+ :::{note}
43+ The treatment of topologies in ` tskit ` is restricted to trees with a single defined root,
44+ without nodes with a single child (i.e. trees must consist of nodes that are either leaves,
45+ or internal nodes with two or more children). For convenience in the examples
46+ below, trees are drawn with the tips flagged as samples, although whether a node is a sample or
47+ not does not change the topology of the tree.
48+ :::
49+
50+ ## Tree labellings and shapes
51+
52+ The topology of a tree also takes into account the labelling of tips, so that
53+ the trees below, although they have the same * shape* , count as three
54+ different topologies:
55+
56+ ``` {code-cell}
57+ :tags: [hide-input]
58+ from string import ascii_lowercase
59+ from IPython.display import SVG
60+
61+ def str_none(s, prefix=None):
62+ if s is not None:
63+ if prefix is None:
64+ return str(s)
65+ else:
66+ return prefix + " = " + str(s)
67+ return None
68+
69+ def draw_svg_trees(trees, node_labels={}, x_lab_attr=None, width=100, height=150, space=10):
70+ w = width + space
71+ h = height + space
72+ trees = list(trees)
73+ s = f'<svg height="{h}" width="{w * len(trees)}" xmlns="http://www.w3.org/2000/svg">'
74+ s += f'<style>.x-axis {{transform: translateY({space}px)}}</style>'
75+ for i, tree in enumerate(trees):
76+ s += tree.draw_svg(
77+ size=(width, height),
78+ canvas_size=(w, h),
79+ root_svg_attributes={"x": i * w},
80+ node_labels=node_labels,
81+ x_label=str_none(getattr(tree.rank(), x_lab_attr or "", None), x_lab_attr)
82+ )
83+ s += '</svg>'
84+ return SVG(s)
85+
86+ draw_svg_trees(tskit.all_tree_labellings(tree), node_labels={u: ascii_lowercase[u] for u in tree.samples()})
87+ ```
88+
89+ These are, in fact, the only possible three labellings for a three-tip tree of that shape.
90+ There is only one other possible shape for a three-tip tree, and for this shape,
91+ all labelling orders are equivalent (in other words, there is only one
92+ possible labelling):
93+
94+ ``` {code-cell}
95+ :tags: [hide-input]
96+ tskit.Tree.generate_star(3).draw_svg(node_labels={})
97+ ```
98+
99+ A 3-tip tree therefore has only four possible topologies.
100+ These can be generated with the {func}` ~tskit.all_trees ` function.
101+
102+ ``` {code-cell}
103+ generated_trees = tskit.all_trees(3)
104+ print("For a three-tip tree there are", len(list(generated_trees)), "labelled topologies.")
105+ ```
106+
107+ Here they are, plotted out with their shapes enumerated from zero:
108+
109+ ``` {code-cell}
110+ :tags: [hide-input]
111+ draw_svg_trees(
112+ tskit.all_trees(3),
113+ node_labels={u: ascii_lowercase[u] for u in tree.samples()},
114+ x_lab_attr="shape"
115+ )
116+ ```
117+
118+ ### Enumerating shapes and labellings
119+
120+ For a tree with four tips, more topologies and shapes are possible. As before, we can generate the
121+ topologies using {func}` ~tskit.all_trees ` . Alternatively, if we only want the (unlabelled) shapes,
122+ we can use the {func}` ~tskit.all_tree_shapes ` function:
123+
124+ ``` {code-cell}
125+ print("For a four-tip tree there are", len(list(tskit.all_trees(4))), "labelled topologies.")
126+
127+ generated_trees = tskit.all_tree_shapes(4)
128+ print("These can be categorised into", len(list(generated_trees)), "shapes.")
129+ ```
130+
131+ Again, we can give each shape a number or * index* , starting from zero:
132+
133+ ``` {code-cell}
134+ :tags: [hide-input]
135+ draw_svg_trees(tskit.all_tree_shapes(4), x_lab_attr="shape")
136+ ```
137+
138+ Each of these shapes will have a separate number of possible labellings, and trees with
139+ these labellings can be created using {func}` ~tskit.all_tree_labellings ` :
140+
141+ ``` {code-cell}
142+ for shape_index, tree in enumerate(tskit.all_tree_shapes(4)):
143+ labellings = tskit.all_tree_labellings(tree)
144+ num_labellings = len(list(labellings))
145+ print(
146+ f"Tree shape {shape_index} for a four-tip tree has "
147+ f"{num_labellings} labelling{'' if num_labellings==1 else 's'}."
148+ )
149+ ```
150+
151+ Any tree topology for a tree of $N$ tips can therefore be described by a
152+ shape index combined with a labelling index. This is known as the
153+ * rank* of a tree, and it can be obtained using the
154+ {meth}` Tree.rank ` method. For instance, here is the rank of a simulated tree
155+ of 10 tips:
156+
157+ ``` {code-cell}
158+ :tags: [hide-input]
159+ import msprime
160+ num_tips = 10
161+ simulated_ts = msprime.sim_ancestry(10, ploidy=1, random_seed=123)
162+ simulated_tree = simulated_ts.first()
163+ print("The topology of the simulated tree below can be described as", simulated_tree.rank())
164+ ascii_node_labels = {u: ascii_lowercase[u] for u in simulated_tree.samples()}
165+ simulated_tree.draw_svg(node_labels=ascii_node_labels)
166+ ```
167+
168+
169+ A tree with the same topology (i.e. the same shape and labelling, but ignoring
170+ the branch lengths) can be generated using the {meth}` Tree.unrank ` method, by
171+ specifying the number of tips and the appropriate ` (shape, labelling) ` tuple:
172+
173+ ``` {code-cell}
174+ new_tree = tskit.Tree.unrank(num_tips, (1270, 21580))
175+ new_tree.draw_svg(node_labels=ascii_node_labels)
176+ ```
177+
178+ Note that this method generates a single tree in a new tree sequence
179+ whose a default sequence length is 1.0.
180+
181+ ## Methods for large trees
182+
183+ The number of possible topologies for a tree with $N$ tips
184+ grows very rapidly with $N$. For instance, with 10 tips, there are
185+ 282,137,824 possible topologies.
186+
187+ For this reason, the {func}` ~tskit.all_trees ` , {func}` ~tskit.all_tree_shapes ` and
188+ {func}` ~tskit.all_tree_labellings ` methods do not return a list of trees
189+ but an iterator over the trees. This means it is perfectly possible to start
190+ iterating over (say) all tree shapes for a tree of 100 leaves, but
191+ the iterator will not finish before the death of our galaxy.
192+
193+ ``` {code-cell}
194+ for num_trees, tree in enumerate(tskit.all_tree_shapes(100)):
195+ shape = tree.rank().shape
196+ b2 = tree.b2_index()
197+ print(f"A 100-tip tree with shape index {shape} has a b2 balance index of {b2}")
198+ if num_trees > 5:
199+ break # better not let this run too long!
200+ ```
201+
202+ For similar combinatorial reasons, the {meth}` Tree.rank ` method can be
203+ inefficient for large trees. To compare the topology of two trees, you are
204+ therefore recommended to use e.g. the {meth}` Tree.kc_distance ` method
205+ rather than comparing ranks directly.
206+
207+ ``` {code-cell}
208+ simulated_tree = simulated_ts.first(sample_lists=True) # kc_distance requires sample lists
209+ if simulated_ts.first(sample_lists=True).kc_distance(simulated_tree) == 0:
210+ print("Trees are identical")
211+ # To compare to the new_tree we need to fix
212+ # print("The simulated and topology-constructed trees have the same topology")
213+ ```
214+
215+ Despite the combinatorial explosion associated with topologies of
216+ many-tip trees, it is still possible to efficiently count
217+ the number of * embedded topologies* in a large tree.
218+
219+ ### Embedded topologies
220+
221+ An embedded topology is a a topology involving a subset of the tips of a tree.
222+ If the tips are classified into (say) three groups, red, green, and blue,
223+ we can efficiently count all the embedded three-tip trees which have
224+ one tip from each group using the {meth}` Tree.count_topologies ` method.
225+
226+ ``` {code-cell}
227+ :tags: [hide-input]
228+
229+ newick_species_tree = "((A:100.0,B:100.0):100.0,C:200.0)"
230+ demography = msprime.Demography.from_species_tree(newick_species_tree, initial_size=100)
231+ ts = msprime.sim_ancestry({0: 2, 1: 2, 2: 2}, demography=demography, random_seed=321)
232+ big_tree = ts.first()
233+ styles = [
234+ f".node.sample.p{p.id} > .sym " + "{" + f"fill: {colour}" + "}"
235+ for colour, p in zip(['red', 'green', 'blue'], big_tree.tree_sequence.populations())
236+ ]
237+ big_tree.draw_svg(style="".join(styles), node_labels={}, time_scale="rank", x_label="big_tree")
238+ ```
239+
240+ In this tree, it is fairly clear that the red and green tips cluster together (although not exclusively),
241+ so that if we took one red, one blue, and one green tip at random, we nearly always see the
242+ same embedded topology. The {meth}` Tree.count_topologies ` method does this exhaustively:
243+
244+ ``` {code-cell}
245+ # By default `count_topologies` chooses one tip from each population, like setting
246+ # sample_sets=[ts.samples(p.id) for p in ts.populations() if len(ts.samples(p.id)) > 0]
247+
248+ topology_counter = big_tree.count_topologies()
249+
250+ colours = ['red', 'green', 'blue']
251+ styles = [f".n{u}>.sym {{fill: {c} }}" for u, c in enumerate(colours)]
252+
253+ embedded_counts = topology_counter[0, 1, 2]
254+ for embedded_tree in tskit.all_trees(3):
255+ rank = embedded_tree.rank()
256+ number_of_instances = embedded_counts[rank]
257+ label = f"{number_of_instances} instances embedded in big_tree"
258+ display(embedded_tree.draw_svg(style="".join(styles), node_labels={}, x_label=label))
259+ ```
260+
261+ ## Methods over tree sequences
262+
263+ It can be useful to count embedded topologies over an entire tree sequence.
264+ For instance, we might want to know the number of embedded topologies
265+ that support Neanderthals as a sister group to europeans versus africans.
266+ ` Tskit ` provides the efficient {meth}` TreeSequence.count_topologies ` method to
267+ do this [ incrementally] ( sec_incremental ) , without having to re-count the topologies
268+ independently in each tree.
269+
270+ ``` {code-cell}
271+ import stdpopsim
272+ species = stdpopsim.get_species("HomSap")
273+ model = species.get_demographic_model("OutOfAfrica_3G09")
274+ contig = species.get_contig("chr1", length_multiplier=0.0002, mutation_rate=model.mutation_rate)
275+ samples = {"YRI": 1000, "CEU": 1000, "CHB": 1000}
276+ engine = stdpopsim.get_engine("msprime")
277+ ts = engine.simulate(model, contig, samples)
278+ print("Simulated", ts.num_trees, "African+European+Chinese trees, each with", ts.num_samples, "tips")
279+ ```
280+
281+ Although the trees in this tree sequence are very large, counting the embedded topologies is
282+ quite doable (for speed we are only simulating 0.02% of chromosome 1, but calculating the
283+ average over an entire chromosome simply takes a little longer)
284+
285+ ``` {code-cell}
286+ from datetime import datetime
287+ names = {"YRI": "African", "CEU": "European", "CHB": "Chinese"}
288+ colours = {"YRI": "yellow", "CEU": "green", "CHB": "blue"}
289+
290+ population_map = {p.metadata["id"]: p.id for p in ts.populations()}
291+ sample_populations = [population_map[name] for name in samples.keys()]
292+ topology_span = {tree.rank(): 0 for tree in tskit.all_trees(len(sample_populations))}
293+
294+ start = datetime.now()
295+ total = 0
296+ for topology_counter, tree in zip(ts.count_topologies(), ts.trees()):
297+ embedded_topologies = topology_counter[sample_populations]
298+ weight = tree.span / ts.sequence_length
299+ for rank, count in embedded_topologies.items():
300+ topology_span[rank] += count * weight
301+ total += count
302+ print(f"Counted {total} embedded topologies in {datetime.now() - start} seconds")
303+ ```
304+
305+ ``` {code-cell}
306+ :tags: [hide-input]
307+ ntips = len(sample_populations)
308+ styles = ".sample text.lab {baseline-shift: super; font-size: 0.7em;}"
309+ node_labels = {}
310+
311+ for p in range(ntips):
312+ name = ts.population(sample_populations[p]).metadata["id"]
313+ node_labels[p] = names[name]
314+ styles += f".n{p}>.sym {{fill: {colours[name]} }}"
315+
316+ total = sum(topology_span.values())
317+ for rank, weight in topology_span.items():
318+ label = f"{weight/total *100:.1f}% of genome"
319+ embedded_tree = tskit.Tree.unrank(ntips, rank)
320+ display(embedded_tree.draw_svg(size=(160, 150), style="".join(styles), node_labels=node_labels, x_label=label))
321+ ```
322+
323+ For an example with real data, see {ref}` sec_popgen_topological `
324+ in the {ref}` sec_intro_popgen ` tutorial.
0 commit comments