1 Introduction
A wide range of tasks in computational biology start by determining, for a given “query gene” in a species , one or all genes in another species that are most similar to . Conceptually, these best matches of the query genes are meant to approximate the set of genes in that are evolutionary most closely related to . Best matches can be identified by comparing evolutionary distances (Nei and Zhang, 2006), which in turn are usually obtained from sequence alignments (Chatzou et al., 2016). In practice, fast approximation algorithms such as blast and its more modern successors are often used for this purpose (MorenoHagelsieb and Latimer, 2008, HernándezSalmerón and MorenoHagelsieb, 2020). Even if sequence similarity is measured perfectly, deviations from a common molecular clock, i.e., differences in the evolutionary rates of different genes, cause discrepancies between best hits (most similar sequences) and best matches (evolutionary most closely related sequences), see (Stadler et al., 2020) for a detailed discussion.
The idea of best matches in the sense of closest evolutionary relatedness presupposes an underlying tree that describes the phylogenetic relationships among the genes, which correspond to the leaves of , and a map assigning to each gene the species in which it resides. Given such a leafcolored tree , the best match graph has as its vertex set the leaves of , i.e., the set of genes, and as (directed) arcs the best matches. The latter are defined as the pairs for which the last common ancestor of and is at least as close to as the last common ancestor of and any other gene from the same spaces . Best match graphs (BMGs), i.e., graphs that are derived from a leaflabeled tree in this manner, form a very restrictive class of colored digraphs (Geiß et al., 2019, Schaller et al., 2021a). Empirically determined best hit data therefore will in general not satisfy the defining properties of BMGs. They can be corrected in part, however, by considering the problem of editing a given graph to the closest BMG. Schaller et al. (2021d)
showed that the arc modification problems for BMGs are NPhard, but can be formulated as integer linear programs (ILPs) allowing practical solutions for small instances. However, in computational biology, applications to large gene families would be of particular interest, creating the need for faster, approximate solutions for BMG editing. Before embarking to develop software for a BMGbased analysis of large sequence data sets, we need to understand whether the editing problem for BMGs is tractable in practice with sufficient accuracy and for interestingly large instances. The purpose of this contributions is to establish that this is indeed the case.
Motivated by both theoretical and practical considerations, we are mainly interested in heuristics that are consistent in the following sense: Let be an algorithm that takes an arbitrary vertexcolored digraph as input and outputs a BMG . Then is consistent if whenever the input graph is a BMG.
A vertexcolored digraph is a BMG if and only if (a) a certain set of informative triples , which can easily be read off the input graph, is consistent and (b) the BMG of the corresponding socalled Aho tree coincides with (Schaller et al., 2021a). In general, the Aho tree of a compatible set of triples on a set is a least resolved supertree of all the triples in . For a BMG, is the unique least resolved tree (LRT) for . These close connections between recognizing BMGs and constructing supertrees suggest to adapt ideas from heuristic algorithms for triple consistency problems and supertree construction for BMG editing.
The simplest approach, therefore, is to extract a maximal consistent subset from and to use as an approximation, see Sec. 4. A more detailed analysis of arcs in that violate the property of being a BMG in Sec. 5, however, leads a notion of “unsatisfiable relations” (UR), which can be used to count the arc modifications associated with a partition of the vertex set of . It also gives raise to a topdown algorithm in which the vertex set of is recursively edited and partitioned. A large class of heuristics for BMG editing can be constructed depending on the construction of the partition in each recursion step. We shall see that the arc edit sets in different steps of the recursion are disjoint. A main result of this contribution, Thm. 3, links the partitions appearing in BMG editing algorithms to the auxiliary graphs appearing in the BUILD algorithm for supertree construction (Aho et al., 1981), see Sec. 2. This provides a guarantee that the BMG editing algorithms are consistent provided the choice of is such that it does not enforce edits whenever an alternative partition with and empty UR is available. For BMGs, this is in particular the case for the partitions appearing in the BUILD algorithm. In Sec. 7, we proceed to show by reduction from Set Splitting that finding a partition with a minimal number of unsatisfiable relations is NPhard.
The theoretical results are complemented by computational experiments on BMGs with randomly perturbed arc sets in Sec. 8. We focus on a comparison of different algorithms to construct the partitions . Somewhat surprisingly, we find that minimizing the cardinality of the UR alone is not the best approach, since this tends to produce very unbalanced partitions and thus requires a large number of steps in the recursions whose costs add up. Instead, certain types of clustering or community detection approaches that favor more balanced partitions tend to perform well.
2 Notation and Preliminaries
Partitions.
is a partition of a set if (i) , (ii) and (iii) for . A partition is nontrivial if . Consider two partitions and of . If for every there is a such that , i.e., if every set in is completely contained in a set in , then is a refinement of , and is a coarsegraining of .
Graphs.
Mostly, we consider simple directed graphs (digraphs) with vertex set and arc set . We will frequently write and to explicitly refer to the graph . For a vertex , we say that is an inarc and is an outarc. The subgraph induced by a subset is denoted by . Undirected graphs can be identified with symmetric digraphs, i.e., the undirected graph underlying a digraph is obtained by dropping the direction of all arcs, or by symmetrizing the digraph, i.e., adding the arc to for every arc . When referring to an undirected graph , we write for and call an edge. The (weakly) connected components of are the maximal connected subgraphs of the undirected graph underlying or, equivalently, the maximal strongly connected components of the symmetrized digraph. Whenever the context is clear, we will also refer to the partition of formed by the vertex sets of the maximal connected subgraphs as the set of connected components.
A vertex coloring is a map , where is a nonempty set of colors. A vertex coloring of is proper if whenever . We write for a vertexcolored digraph and denote by the subset of vertices of a graph that have color . Moreover, we define for the subset of colors present in a set .
We write for the set of outneighbors of a vertex and for the set of inneighbors of . A graph is called sinkfree if holds for all . We write for the symmetric difference of two sets and , and define, for a graph and arc set , the graph . Analogously, we write and .
Phylogenetic trees.
Consider an undirected, rooted tree with leaf set and root . Its inner vertices are given by the set . The ancestor order on is defined such that if lies on the unique path from to the root , i.e., if is an ancestor of . For brevity we set if and . If is an edge in such that , then is the parent of and the child of . The set of children of a vertex is denoted by . A tree is phylogenetic if all its inner vertices have at least two children. All trees considered in this contribution will be phylogenetic. For a nonempty subset , we define , the last common ancestor of , to be the unique minimal vertex of that is an ancestor of every . Following e.g. Bryant and Steel (1995), we denote by the restriction of to a subset , i.e. is obtained by identifying the (unique) minimal subtree of that connects all leaves in , and suppressing all vertices with degree two except possibly the root . We say that displays or is a refinement of a tree , in symbols , if can be obtained from a restriction of by a series of inner edge contractions. is a leafcolored tree if is a map from the leaves of into a nonempty set of colors. We say that is displayed by if and for all .
Rooted triples.
A (rooted) triple is a tree on three leaves and with two inner vertices. We write for the triple on the leaves and if the path from to does not intersect the path from to the root in , i.e., if . In this case we say that displays . We write for the restriction of a triple set to a set of leaves. A set of triples is consistent if there is a tree with leaf set that displays every triple in . The polynomialtime algorithm BUILD decides for every triple set whether it is consistent, and if so, constructs a particular tree, the Aho tree , that displays every triple in (Aho et al., 1981). The algorithm relies on the construction of an (undirected) auxiliary graph, the Aho graph, for a given triple set on a set of leaves . This graph, denoted by , contains an edge if and only if for some .
3 Best match graphs
Definition 1.
Let be a leafcolored tree. A leaf is a best match of the leaf if and holds for all leaves of color .
Given , the graph with vertex set , vertexcoloring , and with arcs if and only if is a best match of w.r.t. is called the best match graph (BMG) of (Geiß et al., 2019).
Definition 2.
An arbitrary vertexcolored graph is a best match graph (BMG) if there exists a leafcolored tree such that . In this case, we say that explains .
We say that is an BMG if . By construction, there is at least one best match of for every color :
Observation 1.
For every vertex and every color in a BMG there is some vertex with . Equivalently, the subgraph induced by every pair of colors is sinkfree.
In particular, therefore, BMGs are sinkfree whenever they contain at least two colors. We formalize this basic property of BMGs for colored graphs in general:
Definition 3.
Let be a colored graph. The coloring is sinkfree if it is proper and, for every vertex and every color in , there is a vertex with . A graph with a sinkfree coloring is called sfcolored.
Given a tree and an edge , we denote by the tree obtained from by contracting the edge . We call an edge in redundant (w.r.t. ) if both and explain .
Definition 4.
A tree is least resolved for a BMG if (i) it explains and (ii) it does not contain any redundant edges w.r.t. .
Geiß et al. (2019, Thm. 8) showed that every BMG has a unique least resolved tree (LRT).
Geiß et al. (2019) gave a characterization of BMGs that makes use of a set of informative triples, which can be defined compactly as follows (Schaller et al., 2021c):
Definition 5.
Let be a vertexcolored digraph. Then the set of informative triples is
and the set of forbidden triples is  
For the subclass of BMGs that can be explained by binary trees, we will furthermore need  
By definition, must be pairwise distinct whenever , , or .
We extend the notion of consistency to pairs of triple sets in
Definition 6.
Let and be sets of triples. The pair is called consistent if there is a tree that displays all triples in but none of the triples in . In this case, we also say that agrees with .
It can be decided in polynomial time whether such a pair is consistent using the algorithm MTT, which was named after the corresponding mixed triplets problem restricted to trees and described by He et al. (2006).
We continue with two simple observations concerning the restriction of triple sets. Since informative and forbidden triples are only defined by the presence and absence of arcs in the subgraph of induced by , this leads to the following
Observation 2.
(Schaller et al., 2021b) Let be a vertexcolored digraph and . Then holds for every .
Moreover, any pair of triples such that and for a consistent pair remains consistent since any tree that agrees with clearly displays all triples in and none of the triples in . Hence, we have
Observation 3.
Let and for a consistent pair of triple sets . Then is consistent.
We summarize two characterizations of BMGs given in (Schaller et al., 2021a, Thm. 15) and (Schaller et al., 2021d, Lemma 3.4 and Thm. 3.5) in the following
Proposition 1.
Let be a properly colored digraph with vertex set . Then the following three statements are equivalent:

is a BMG.

is consistent and .

is sfcolored and is consistent.
In this case, is the unique LRT for , and a leafcolored tree on explains if and only if it agrees with .
Prop. 1 states that the set of informative triples of a BMG is consistent. Therefore, it can be used to construct its LRT by means of the BUILD algorithm, see Fig. 1 for an example.
It is important to note that both arc insertions and deletions may lead to creation and loss of informative triples. In particular, when starting from a BMG, both types of modifications have the potential to make the triple set inconsistent as the example in Fig. 2 shows. This is indeed often the case even for moderate disturbances of a BMG as we shall see in Sec. 8.
We expect that empirically estimated best match relations will typically contain errors that correspond to both arc insertions and deletions w.r.t. the unknown underlying “true” best match graph. This motivates the problem of editing a given vertexcolored digraph to a BMG:
Problem 1 (BMG Editing).
Input:  A properly colored digraph and an integer . 

Question:  Is there a subset such that and 
is an BMG? 
Natural variants are BMG Completion and BMG Deletion where is replaced by and , respectively, i.e., only addition or deletion of arcs is allowed. Both BMG Editing and its variants are NPcomplete (Schaller et al., 2021d).
The heuristic algorithms considered in this contribution can be thought of as maps on the set of finite vertexcolored digraphs such that is a BMG for every vertexcolored input graph . In particular, the following property of such algorithms is desirable:
Definition 7.
A (BMGediting) algorithm is consistent if whenever is a BMG.
4 A simple, triplebased heuristic
The triplebased characterization summarized by Prop. 1 suggests a simple heuristic for BMG editing that relies on replacing the consistency checks for triple sets by the extraction of maximal sets of consistent triples (see Alg. 1). Both MaxRTC, the problem of extracting from a given set of rooted triples a maximumsize consistent subset, and MinRTI, the problem of finding a minimumsize subset such that is consistent, are NPhard (Jansson, 2001). Furthermore, MaxRTC is APXhard and MinRTI is inapproximable (Byrka et al., 2010). However, because of their practical importance in phylogenetics, a large number of practically useful heuristics have been devised, see e.g. (Gasieniec et al., 1999, Wu, 2004, Tazehkand et al., 2013).
As a consequence of Prop. 1, Alg. 1 is consistent, i.e., if and only if the input graph is a BMG, if a consistent heuristic is employed to solve MaxRTG/MinRTI, i.e., if consistent triple sets remain unchanged by the method approximating MaxRTG/MinRTI.
The heuristic Alg. 1 is not always optimal, even if MaxRTC/MinRTI is solved optimally. Fig. 3 shows an unconnected 2colored graph on three vertices that is not a BMG and does not contain informative triples. The BMG produced by Alg. 1 introduces two arcs into . However, can also be edited to a BMG by inserting only one arc.
A simple improvement is to start by enforcing obvious arcs: If is the only vertex with color , then by definition there must be an arc for every vertex . The computation then starts from the sets of informative triples of the modified graph. We shall see below that these are the only arcs that can safely be added to without other additional knowledge or constraints (cf. Thm. 2 below).
5 Locally optimal splits
Finding an optimal BMG editing of a digraph is equivalent to finding a tree on that minimizes the cardinality of
(1) 
Clearly, implies that is a BMG. However, finding a tree that minimizes is intractable (unless ) since BMG Editing, Problem 1 above, is NPcomplete (Schaller et al., 2021d).
We may ask, nevertheless, if trees on contain information about arcs and nonarcs in that are “unambiguously false” in the sense that they are contained in every edit set that converts into a BMG. Denote by the set of all phylogenetic trees on . The set of these “unambiguously false” (non)arcs can then be expressed as
(2) 
Since there are in general exponentially many trees on and thus, the problem of determining seems to be quite challenging on a first glance. We shall see in Thm. 2, however, that can be computed efficiently. We start with a conceptually simpler construction.
Definition 8.
Let be a properly vertexcolored digraph and a partition of with . Moreover, let be the set of trees on that satisfy . The set of unsatisfiable relations (UR), denoted by , is defined as
(3) 
The associated URcost is .
The set of (phylogenetic) trees is nonempty since in Def. 8. Moreover, by construction, if and only if
Intriguingly, the set , and thus the URcost , can be computed in polynomial time without any explicit knowledge of the possible trees to determine the set . To this end, we define the three sets
Lemma 1.
Let be a properly vertexcolored digraph and let be a partition of with . Then
The proof of Lemma 1 relates the possible cases between and the tree set in a straightforward manner. Since it is rather lengthy it is relegated to Appendix. Fig. 4 gives examples for all three types of unsatisfiable relations, i.e., for , , and . In particular, we have since it is an arc in but contains another red vertex . Moreover, since it is not an arc in but does not contain another green vertex. Finally, we have since it is not an arc in but is the only blue vertex in . In the example, the graph is already a BMG which, however, is not true in general.
Corollary 1.
The set can be computed in quadratic time.
Proof.
We first compute all numbers of vertices in with a given color . This can be done in if we do not explicitly store the zeroentries. Now, , i.e. , can be checked in constant time, and thus, it can also be decided in constant time whether or not a pair is contained in or . Since, given , the condition is equivalent to , membership in
can also be decided in constant time. Checking all ordered pairs
thus requires a total effort of . ∎∎Our discussion so far suggests a recursive topdown approach, made precise in Alg. 2. In each step, one determines a “suitably chosen” partition and then recurses on the subgraphs of the edited graph . More details on such suitable partitions will be given in Thm. 3 below. The parts in the algorithm highlighted in color can be omitted. They are useful, however, if one is also interested in a tree that explains the editing result and to show that is indeed a BMG (see below). Alg. 2 is designed to accumulate the edit sets in each step, Line 2. In particular, the total edit cost and the scores are closely tied together, which follows from the following result:
Lemma 2.
All edit sets constructed in Alg. 2 are pairwise disjoint.
The proof of Lemma 2 and a technical result on which it relies can be found in the Appendix. As an immediate consequence of Lemma 2, we have
Corollary 2.
The edit cost of Alg. 2 is the sum of the URcosts in each recursion step.
It is important to note that the edits must be applied immediately in each step (cf. Line 2 in Alg. 2). In particular, Lemma 2 and Cor. 2 pertain to the partitioning of the edited graph , not to the original graph .
Theorem 1.
Every pair of edited graph and tree produced as output by Alg. 2 satisfies . In particular, is a BMG.
Proof.
By construction, the tree is phylogenetic and there is a onetoone correspondence between the vertices and the recursion steps, which operate on the sets . If (or, equivalently, is an inner vertex of ), we furthermore have for the partition of chosen in that recursion step. In the following, we denote by the graph during the editing process, and by the input graph, i.e., as in Alg. 2. For brevity, we write for the arc set of the final edited graph and .
Let us assume, for contradiction, that there exists (a) , or (b) . In either case, we set and consider the recursion step on with the corresponding partition chosen for . Note that , and thus . Moreover, let be the child of such that , and .
Case (a): .
Since
and by the definition of best matches, there must be a
vertex of color such that
, and thus
. Moreover, we have ,
and . Two subcases need to be
considered, depending on whether or not is an arc in at
the beginning of the recursion step. In the first case, the
arguments above imply that ,
and thus, by
Lemma 1. Hence, we delete the arc in
this step. In the second case, it is an easy task to verify that
none of the definitions of ,
, and matches
for . Since this step is clearly the last one in the
recursion hierarchy that can affect the (non)arc , it follows
for both subcases that ; a contradiction.
Case (b): .
Since
and by the definition of best matches, there cannot be a
vertex of color such that
, and thus
. Moreover, we have ,
and . Again, two subcases need to
be distinguished depending on whether or not is an arc in
at the beginning of the recursion step. In the first case,
the arguments above make it easy to verify that none of the
definitions of ,
, and matches
for . In the second case, we obtain
, and thus,
by
Lemma 1. Hence, we insert the arc in
this step. As before, the (non)arc remains unaffected in any
deeper recursion step. Therefore, we have in both
subcases; a contradiction.
Finally, immediately implies that is a BMG. ∎∎
Cor. 2 suggests a greedylike “local” approach. In each step, the partition is chosen to minimize the score in Line 2. The example in Fig. 5 shows, however, that the greedylike choice of does not necessarily yield a globally optimal edit set.
In order to identify arcs that must be contained in every edit set, we first clarify the relationship between the partitions on and the partitions defined by the phylogenetic trees on .
Lemma 3.
Let be a set with . Let be the set of all partitions of with . Then the set of all phylogenetic trees with leaf set satisfies .
Proof.
For every , is a set of phylogenetic trees on . Hence, we conclude . Conversely, assume that . Since (with root ) is a phylogenetic tree and has at least two leaves, we have . Together with , this implies . In particular, satisfies for some , and is therefore contained in . ∎∎
Using Lemma 3 and given that , we can express the set of relations that are unsatisfiable for every partition as follows
(4) 
i.e., it coincides with the set of relations that are unsatisfiable for every phylogenetic tree, and thus part of every edit set. Note that is trivially empty if . We next show that can be computed without considering the partitions of explicitly.
Theorem 2.
Let be a properly vertexcolored digraph with then
(5) 
Comments
There are no comments yet.