Program Analysis using Abstract Interpretation

So far we've seen that we can prove facts about program using Hoare
logic.  For instance, I can show that a sorting program yields a
sorted array when the program terminates, regardless of what the array
contents were at the beginning of the program. In general, we can
statically (i.e., at compile-time) prove that certain facts hold for
all dynamic executions of the program. And our language of assertions
was powerful enough to allow us express complex program
properties. The complication with this approach is that it requires
coming up with the appropriate loop invariants.

Can we come up with a technique that automatically figures out loop
invariants? It turns out that if we model the desired properties using
an abstraction that is suitably finite, then it is possible to
automatically compute loop invariants and determine whether the
properties hold at the end of the program. The key part is to keep the
abstraction finite. Once we've established such a finite abstraction,
the idea is to compute the desired information by "executing" (or
interpreting) the program in the abstract domain; in the case of
loops, the execution will yield the desired loop invariant. This
approach is referred to as abstract interpretation, or (data-) flow
analysis, and is one of the most important techniques in static
program analysis.

To introduce this method, consider the problem of computing signs for
variables in the program. More precisely, for each variable in the
program, I want to determine whether the variable is positive, zero,
negative, or otherwise has unknown sign. To model this, I will use the
following abstract domain of possible signs:

   AbsInt = { pos, zero, neg, anyint }

In contrast to the concrete domain Int of possible integer values, the
abstract domain AbsInt is clearly finite. Furthermore, I can arrange
the elements of AbsInt as shown in the following diagram:

                 anyint
                /  |   \
              pos zero neg

This diagram shows an ordering between the elements of AbsInt. If two
elements are connected by an edge, then the element higher up is a
conservative approximation of the other element. We say that the
element lower in the diagram is "less than" the element higher up, and
write x "lessthan" y to denote that x is less than y. For instance, pos
is less than anyint; indeed, saying that a variable is positive is more
precise than saying that its sign is unknown.

Note: Technically speaking, such an ordering relation is known as a
_partial order_, defined as a relation that is reflexive,
anti-symmetric, and transitive; and a pair of a set and a partial
order such that every subset has a least upper bound and a greatest
lower bound forms a _complete lattice_. Our lattice does not always
provide a greatest lower bound for each subset, so it is an upper
semi-lattice. Hence, our abstraction is a semi-lattice domain.

When I said earlier that I want my abstraction to be "suitably
finite", what I really meant is that my abstract lattice should have
finite height (it is okay if my abstract set of values is infinite, as
long as the lattice is finite in height). This is what is really
important for the analysis.

I can now define an abstract store AbsStore, which maps variables to
their signs:

    S in AbsStore = Var -> AbsInt

Before I show how the analysis works, let me first compare this
abstraction with the abstraction that assertions is
providing. Essentially, given an abstract store S, for each variable x
in Var:

   S(x) = pos    means   x > 0
   S(x) = zero   means   x = 0
   S(x) = neg    means   x < 0
   S(x) = anyint means   true

Hence, an abstract store S is a conjunction of terms of the form x >
0, x = 0, x < 0, and true. This is all we can express using our
current abstraction, and is clearly less expressive than what we could
do with general assertions. Note that the partial ordering for
assertion is implication (=>), and the domain of assertion clearly has
infinite height with respect to this ordering. Hence, we trade
expressiveness for finiteness.

Now given the above abstraction, the analysis essentially executes the
program in using the abstract store. In other words, the analysis
forgets the concrete values of variables, remembers just their signs,
and interprets the program using just that information. I will express
this interpretation of the program in the abstract domain using
denotations.

For an arithmetic expression, the analysis uses an abstract store to
derive a sign for that expression. The analysis tries to derive a
precise sign for the expression. However, due to the lack of knowledge
about concrete values of variables, in certain cases it may be
conservative and say that the sign of the expression is unknown
(represented as anyint). I express the analysis of arithmetic
expressions using an abstract denotation A'[[a]]:

   A'[[a]] : AbsStore -> AbsInt

                         | pos  if n > 0
   A'[[n]] S = sign(n) = | zero if n = 0
                         | neg  if n < 0

   A'[[x]] S = S(x)

   A'[[a1 + a2]] S = | pos  if (A'[[a1]] S = pos /\ A'[[a2]] S in {zero, pos}) \/
                     |         (A'[[a1]] S = zero /\ A'[[a2]] S = pos)
                     |
                     | neg  if (A'[[a1]] S = neg /\ A'[[a2]] S in {zero, neg}) \/
                     |         (A'[[a1]] S = zero /\ A'[[a2]] S = neg)
                     |
                     | zero if A'[[a1]] S = zero /\ A'[[a2]] S = zero
                     |
                     | anyint otherwise

Note that all of these evaluations use just the information in the
abstract store when reasoning about variables. The evaluation has no
knowledge about the concrete values of variables. In the last case,
the analysis cannot precisely determine the sign of the expression,
and conservatively returns anyint.

Furthermore, we can define a similar evaluation for boolean
expressions. Given an expression b, an abstract store S, the analysis
must determine a possible truth value for that expression. The
analysis must use an abstract domain for booleans:

  AbsBool = {true, false, anybool}

This domain includes the value anybool, indicating that the precise
truth value of an expression cannot be precisely determined. Again, we
can define an ordering between these values and form a lattice domain:


         anybool
          /   \
        true false

I can then formulate the analysis using an abstract denotation
B'[[b]]:


    B'[[b]] : AbsStore -> AbsBool

    B'[[true]] S = true
    B'[[false]] S = true

The evaluation B'[[a1 < a2]] S must evaluate the signs of
subexpressions a1 and a2 (using the above denotation) and try to
determine the truth value of the comparison based on those signs. For instance:

    B'[[a1 < a2]] S = true if (A'[[a1]] S = neg /\ A'[[a2]] S in {zero, pos}) \/
                              (A'[[a1]] S = zero /\ A'[[a2]] S = pos) \/

The remaining two cases are left as an exercise.