Program Analysis using Abstract Interpretation So far we've seen that we can prove facts about program using Hoare logic. For instance, I can show that a sorting program yields a sorted array when the program terminates, regardless of what the array contents were at the beginning of the program. In general, we can statically (i.e., at compile-time) prove that certain facts hold for all dynamic executions of the program. And our language of assertions was powerful enough to allow us express complex program properties. The complication with this approach is that it requires coming up with the appropriate loop invariants. Can we come up with a technique that automatically figures out loop invariants? It turns out that if we model the desired properties using an abstraction that is suitably finite, then it is possible to automatically compute loop invariants and determine whether the properties hold at the end of the program. The key part is to keep the abstraction finite. Once we've established such a finite abstraction, the idea is to compute the desired information by "executing" (or interpreting) the program in the abstract domain; in the case of loops, the execution will yield the desired loop invariant. This approach is referred to as abstract interpretation, or (data-) flow analysis, and is one of the most important techniques in static program analysis. To introduce this method, consider the problem of computing signs for variables in the program. More precisely, for each variable in the program, I want to determine whether the variable is positive, zero, negative, or otherwise has unknown sign. To model this, I will use the following abstract domain of possible signs: AbsInt = { pos, zero, neg, anyint } In contrast to the concrete domain Int of possible integer values, the abstract domain AbsInt is clearly finite. Furthermore, I can arrange the elements of AbsInt as shown in the following diagram: anyint / | \ pos zero neg This diagram shows an ordering between the elements of AbsInt. If two elements are connected by an edge, then the element higher up is a conservative approximation of the other element. We say that the element lower in the diagram is "less than" the element higher up, and write x "lessthan" y to denote that x is less than y. For instance, pos is less than anyint; indeed, saying that a variable is positive is more precise than saying that its sign is unknown. Note: Technically speaking, such an ordering relation is known as a _partial order_, defined as a relation that is reflexive, anti-symmetric, and transitive; and a pair of a set and a partial order such that every subset has a least upper bound and a greatest lower bound forms a _complete lattice_. Our lattice does not always provide a greatest lower bound for each subset, so it is an upper semi-lattice. Hence, our abstraction is a semi-lattice domain. When I said earlier that I want my abstraction to be "suitably finite", what I really meant is that my abstract lattice should have finite height (it is okay if my abstract set of values is infinite, as long as the lattice is finite in height). This is what is really important for the analysis. I can now define an abstract store AbsStore, which maps variables to their signs: S in AbsStore = Var -> AbsInt Before I show how the analysis works, let me first compare this abstraction with the abstraction that assertions is providing. Essentially, given an abstract store S, for each variable x in Var: S(x) = pos means x > 0 S(x) = zero means x = 0 S(x) = neg means x < 0 S(x) = anyint means true Hence, an abstract store S is a conjunction of terms of the form x > 0, x = 0, x < 0, and true. This is all we can express using our current abstraction, and is clearly less expressive than what we could do with general assertions. Note that the partial ordering for assertion is implication (=>), and the domain of assertion clearly has infinite height with respect to this ordering. Hence, we trade expressiveness for finiteness. Now given the above abstraction, the analysis essentially executes the program in using the abstract store. In other words, the analysis forgets the concrete values of variables, remembers just their signs, and interprets the program using just that information. I will express this interpretation of the program in the abstract domain using denotations. For an arithmetic expression, the analysis uses an abstract store to derive a sign for that expression. The analysis tries to derive a precise sign for the expression. However, due to the lack of knowledge about concrete values of variables, in certain cases it may be conservative and say that the sign of the expression is unknown (represented as anyint). I express the analysis of arithmetic expressions using an abstract denotation A'[[a]]: A'[[a]] : AbsStore -> AbsInt | pos if n > 0 A'[[n]] S = sign(n) = | zero if n = 0 | neg if n < 0 A'[[x]] S = S(x) A'[[a1 + a2]] S = | pos if (A'[[a1]] S = pos /\ A'[[a2]] S in {zero, pos}) \/ | (A'[[a1]] S = zero /\ A'[[a2]] S = pos) | | neg if (A'[[a1]] S = neg /\ A'[[a2]] S in {zero, neg}) \/ | (A'[[a1]] S = zero /\ A'[[a2]] S = neg) | | zero if A'[[a1]] S = zero /\ A'[[a2]] S = zero | | anyint otherwise Note that all of these evaluations use just the information in the abstract store when reasoning about variables. The evaluation has no knowledge about the concrete values of variables. In the last case, the analysis cannot precisely determine the sign of the expression, and conservatively returns anyint. Furthermore, we can define a similar evaluation for boolean expressions. Given an expression b, an abstract store S, the analysis must determine a possible truth value for that expression. The analysis must use an abstract domain for booleans: AbsBool = {true, false, anybool} This domain includes the value anybool, indicating that the precise truth value of an expression cannot be precisely determined. Again, we can define an ordering between these values and form a lattice domain: anybool / \ true false I can then formulate the analysis using an abstract denotation B'[[b]]: B'[[b]] : AbsStore -> AbsBool B'[[true]] S = true B'[[false]] S = true The evaluation B'[[a1 < a2]] S must evaluate the signs of subexpressions a1 and a2 (using the above denotation) and try to determine the truth value of the comparison based on those signs. For instance: B'[[a1 < a2]] S = true if (A'[[a1]] S = neg /\ A'[[a2]] S in {zero, pos}) \/ (A'[[a1]] S = zero /\ A'[[a2]] S = pos) \/ The remaining two cases are left as an exercise.