Tracking Failure Origins

The question of "Where does this value come from?" is fundamental for debugging. Which earlier variables could possibly have influenced the current erroneous state? And how did their values come to be?

When programmers read code during debugging, they scan it for potential origins of given values. This can be a tedious experience, notably, if the origins spread across multiple separate locations, possibly even in different modules. In this chapter, we thus investigate means to determine such origins automatically – by collecting data and control dependencies during program execution.

from bookutils import YouTubeVideo
YouTubeVideo("sjf3cOR0lcI")

Prerequisites

  • You should have read the Introduction to Debugging.
  • To understand how to compute dependencies automatically (the second half of this chapter), you will need
    • advanced knowledge of Python semantics
    • knowledge on how to instrument and transform code
    • knowledge on how an interpreter works
from bookutils import quiz, next_inputs, print_content
import inspect
import warnings

Synopsis

To use the code provided in this chapter, write

>>> from debuggingbook.Slicer import <identifier>

and then make use of the following features.

This chapter provides a Slicer class to automatically determine and visualize dynamic flows and dependencies. When we say that a variable $x$ depends on a variable $y$ (and that $y$ flows into $x$), we distinguish two kinds of dependencies:

  • Data dependency: $x$ is assigned a value computed from $y$.
  • Control dependency: A statement involving $x$ is executed only because a condition involving $y$ was evaluated, influencing the execution path.

Such dependencies are crucial for debugging, as they allow determininh the origins of individual values (and notably incorrect values).

To determine dynamic dependencies in a function func, use

with Slicer() as slicer:
    <Some call to func()>

and then slicer.graph() or slicer.code() to examine dependencies.

You can also explicitly specify the functions to be instrumented, as in

with Slicer(func, func_1, func_2) as slicer:
    <Some call to func()>

Here is an example. The demo() function computes some number from x:

>>> def demo(x: int) -> int:
>>>     z = x
>>>     while x <= z <= 64:
>>>         z *= 2
>>>     return z

By using with Slicer(), we first instrument demo() and then execute it:

>>> with Slicer() as slicer:
>>>     demo(10)

After execution is complete, you can output slicer to visualize the dependencies and flows as graph. Data dependencies are shown as black solid edges; control dependencies are shown as grey dashed edges. The arrows indicate influence: If $y$ depends on $x$ (and thus $x$ flows into $y$), then we have an arrow $x \rightarrow y$. We see how the parameter x flows into z, which is returned after some computation that is control dependent on a <test> involving z.

>>> slicer

dependencies x_functiondemoat0x10ea4aef0_1 x def demo(x: int) -> int: test_functiondemoat0x10ea4aef0_3 <test> while x <= z <= 64: x_functiondemoat0x10ea4aef0_1->test_functiondemoat0x10ea4aef0_3 z_functiondemoat0x10ea4aef0_2 z z = x x_functiondemoat0x10ea4aef0_1->z_functiondemoat0x10ea4aef0_2 z_functiondemoat0x10ea4aef0_4 z z *= 2 test_functiondemoat0x10ea4aef0_3->z_functiondemoat0x10ea4aef0_4 z_functiondemoat0x10ea4aef0_4->test_functiondemoat0x10ea4aef0_3 z_functiondemoat0x10ea4aef0_4->z_functiondemoat0x10ea4aef0_4 demoreturnvalue_functiondemoat0x10ea4aef0_5 <demo() return value> return z z_functiondemoat0x10ea4aef0_4->demoreturnvalue_functiondemoat0x10ea4aef0_5 z_functiondemoat0x10ea4aef0_2->test_functiondemoat0x10ea4aef0_3 z_functiondemoat0x10ea4aef0_2->z_functiondemoat0x10ea4aef0_4

An alternate representation is slicer.code(), annotating the instrumented source code with (backward) dependencies. Data dependencies are shown with <=, control dependencies with <-; locations (lines) are shown in parentheses.

>>> slicer.code()
*    1 def demo(x: int) -> int:
*    2     z = x  # <= x (1)
*    3     while x <= z <= 64:  # <= z (4), x (1), z (2)
*    4         z *= 2  # <= z (4), z (2); <- <test> (3)
*    5     return z  # <= z (4)

Dependencies can also be retrieved programmatically. The dependencies() method returns a Dependencies object encapsulating the dependency graph.

The method all_vars() returns all variables in the dependency graph. Each variable is encoded as a pair (name, location) where location is a pair (codename, lineno).

>>> slicer.dependencies().all_vars()
{('<demo() return value>', (<function __main__.demo(x: int) -> int>, 5)),
 ('<test>', (<function __main__.demo(x: int) -> int>, 3)),
 ('x', (<function __main__.demo(x: int) -> int>, 1)),
 ('z', (<function __main__.demo(x: int) -> int>, 2)),
 ('z', (<function __main__.demo(x: int) -> int>, 4))}

code() and graph() methods can also be applied on dependencies. The method backward_slice(var) returns a backward slice for the given variable (again given as a pair (name, location)). To retrieve where z in Line 2 came from, use:

>>> _, start_demo = inspect.getsourcelines(demo)
>>> start_demo
1
>>> slicer.dependencies().backward_slice(('z', (demo, start_demo + 1))).graph()

dependencies x_functiondemoat0x10ea4aef0_1 x def demo(x: int) -> int: z_functiondemoat0x10ea4aef0_2 z z = x x_functiondemoat0x10ea4aef0_1->z_functiondemoat0x10ea4aef0_2

Here are the classes defined in this chapter. A Slicer instruments a program, using a DependencyTracker at run time to collect Dependencies.

Slicer Slicer __init__() _repr_mimebundle_() code() dependencies() graph() calls_in_our_with_block() default_items_to_instrument() execute() funcs_in_our_with_block() instrument() our_with_block() parse() restore() transform() transformers() Instrumenter Instrumenter __enter__() __exit__() __init__() instrument() default_items_to_instrument() restore() Slicer->Instrumenter StackInspector StackInspector _generated_function_cache caller_frame() caller_function() caller_globals() caller_locals() caller_location() search_frame() search_func() Instrumenter->StackInspector DependencyTracker DependencyTracker TEST __enter__() __exit__() __init__() arg() call() get() param() ret() set() test() call_generator() check_location() clear_read() dependencies() ignore_location_change() ignore_next_location_change() in_generator() ret_generator() DataTracker DataTracker __enter__() __exit__() __init__() arg() augment() call() get() param() ret() set() test() instrument_call() DependencyTracker->DataTracker DataTracker->StackInspector Dependencies Dependencies FONT_NAME NODE_COLOR __init__() __repr__() __str__() _repr_mimebundle_() all_functions() all_vars() backward_slice() code() graph() _code() _source() add_hierarchy() draw_dependencies() draw_edge() expand_criteria() format_var() id() label() make_graph() repr_dependencies() repr_deps() repr_var() source() tooltip() validate() Dependencies->StackInspector Legend Legend •  public_method() •  private_method() •  overloaded_method() Hover over names to see doc

Dependencies

In the Introduction to debugging, we have seen how faults in a program state propagate to eventually become visible as failures. This induces a debugging strategy called tracking origins:

  1. We start with a single faulty state f – the failure.
  2. We determine f's origins – the parts of earlier states that could have caused the faulty state f.
  3. For each of these origins e, we determine whether they are faulty or not.
  4. For each of the faulty origins, we in turn determine their origins.
  5. If we find a part of the state that is faulty, yet has only correct origins, we have found the defect.

In all generality, a "part of the state" can be anything that can influence the program – some configuration setting, some database content, or the state of a device. Almost always, though, it is through individual variables that a part of the state manifests itself.

The good news is that variables do not take arbitrary values at arbitrary times – instead, they are set and accessed at precise moments in time, as determined by the program's semantics. This allows us to determine their origins by reading program code.

Let us assume you have a piece of code that reads as follows. The middle() function is supposed to return the "middle" number of three values x, y, and z – that is, the one number that neither is the minimum nor the maximum.

def middle(x, y, z):
    if y < z:
        if x < y:
            return y
        elif x < z:
            return y
    else:
        if x > y:
            return y
        elif x > z:
            return x
    return z

In most cases, middle() runs just fine:

m = middle(1, 2, 3)
m
2

In others, however, it returns the wrong value:

m = middle(2, 1, 3)
m
1

This is a typical debugging situation: You see a value that is erroneous; and you want to find out where it came from.

  • In our case, we see that the erroneous value was returned from middle(), so we identify the five return statements in middle() that the value could have come from.
  • The value returned is the value of y, and neither x, y, nor z are altered during the execution of middle(). Hence, it must be one of the three return y statements that is the origin of m. But which one?

For our small example, we can fire up an interactive debugger and simply step through the function; this reveals us the conditions evaluated and the return statement executed.

import Debugger  # minor dependency
with Debugger.Debugger():
    middle(2, 1, 3)
Calling middle(x = 2, y = 1, z = 3)
(debugger) step
2     if y < z:
(debugger) step
3         if x < y:
(debugger) step
5         elif x < z:
(debugger) step
6             return y
(debugger) quit

We now see that it was the second return statement that returned the incorrect value. But why was it executed after all? To this end, we can resort to the middle() source code and have a look at those conditions that caused the return y statement to be executed. Indeed, the conditions y < z, x > y, and finally x < z again are origins of the returned value – and in turn have x, y, and z as origins.

In our above reasoning about origins, we have encountered two kinds of origins:

  • earlier data values (such as the value of y being returned) and
  • earlier control conditions (such as the if conditions governing the return y statement).

The later parts of the state that can be influenced by such origins are said to be dependent on these origins. Speaking of variables, a variable $x$ depends on the value of a variable $y$ (written as $x \leftarrow y$) if a change in $y$ could affect the value of $x$.

We distinguish two kinds of dependencies $x \leftarrow y$, aligned with the two kinds of origins as outlined above:

  • Data dependency: $x$ is assigned a value computed from $y$. In our example, m is data dependent on the return value of middle().
  • Control dependency: A statement involving $x$ is executed only because a condition involving $y$ was evaluated, influencing the execution path. In our example, the value returned by return y is control dependent on the several conditions along its path, which involve x, y, and z.

Let us examine these dependencies in more detail.

Visualizing Dependencies

Note: This is an excursion, diverting away from the main flow of the chapter. Unless you know what you are doing, you are encouraged to skip this part.

To illustrate our examples, we introduce a Dependencies class that captures dependencies between variables at specific locations.

A Class for Dependencies

Dependencies holds two dependency graphs. data holds data dependencies, control holds control dependencies.

Each of the two is organized as a dictionary holding nodes as keys and sets of nodes as values. Each node comes as a tuple

(variable_name, location)

where variable_name is a string and location is a pair

(func, lineno)

denoting a unique location in the code.

This is also reflected in the following type definitions:

Location = Tuple[Callable, int]
Node = Tuple[str, Location]
Dependency = Dict[Node, Set[Node]]

In this chapter, for many purposes, we need to lookup a function's location, source code, or simply definition. The class StackInspector provides a number of convenience functions for this purpose.

from StackInspector import StackInspector

The Dependencies class builds on StackInspector to capture dependencies.

class Dependencies(StackInspector):
    """A dependency graph"""

    def __init__(self, 
                 data: Optional[Dependency] = None,
                 control: Optional[Dependency] = None) -> None:
        """
        Create a dependency graph from `data` and `control`.
        Both `data` and `control` are dictionaries
        holding _nodes_ as keys and sets of nodes as values.
        Each node comes as a tuple (variable_name, location)
        where `variable_name` is a string 
        and `location` is a pair (function, lineno)
        where `function` is a callable and `lineno` is a line number
        denoting a unique location in the code.
        """

        if data is None:
            data = {}
        if control is None:
            control = {}

        self.data = data
        self.control = control

        for var in self.data:
            self.control.setdefault(var, set())
        for var in self.control:
            self.data.setdefault(var, set())

        self.validate()

The validate() method checks for consistency.

class Dependencies(Dependencies):
    def validate(self) -> None:
        """Check dependency structure."""
        assert isinstance(self.data, dict)
        assert isinstance(self.control, dict)

        for node in (self.data.keys()) | set(self.control.keys()):
            var_name, location = node
            assert isinstance(var_name, str)
            func, lineno = location
            assert callable(func)
            assert isinstance(lineno, int)

The source() method returns the source code for a given node.

test_deps = Dependencies()
test_deps.source(('z', (middle, 1)))
'def middle(x, y, z):  # type: ignore'

Drawing Dependencies

Both data and control form a graph between nodes, and can be visualized as such. We use the graphviz package for creating such visualizations.

make_graph() sets the basic graph attributes.

import html
class Dependencies(Dependencies):
    NODE_COLOR = 'peachpuff'
    FONT_NAME = 'Courier'  # 'Fira Mono' may produce warnings in 'dot'

    def make_graph(self,
                   name: str = "dependencies",
                   comment: str = "Dependencies") -> Digraph:
        return Digraph(name=name, comment=comment,
            graph_attr={
            },
            node_attr={
                'style': 'filled',
                'shape': 'box',
                'fillcolor': self.NODE_COLOR,
                'fontname': self.FONT_NAME
            },
            edge_attr={
                'fontname': self.FONT_NAME
            })

graph() returns a graph visualization.

class Dependencies(Dependencies):
    def graph(self, *, mode: str = 'flow') -> Digraph:
        """
        Draw dependencies. `mode` is either
        * `'flow'`: arrows indicate information flow (from A to B); or
        * `'depend'`: arrows indicate dependencies (B depends on A)
        """
        self.validate()

        g = self.make_graph()
        self.draw_dependencies(g, mode)
        self.add_hierarchy(g)
        return g

    def _repr_mimebundle_(self, include: Any = None, exclude: Any = None) -> Any:
        """If the object is output in Jupyter, render dependencies as a SVG graph"""
        return self.graph()._repr_mimebundle_(include, exclude)

The main part of graph drawing takes place in two methods, draw_dependencies() and add_hierarchy().

draw_dependencies() processes through the graph, adding nodes and edges from the dependencies.

class Dependencies(Dependencies):
    def all_vars(self) -> Set[Node]:
        """Return a set of all variables (as `var_name`, `location`) in the dependencies"""
        all_vars = set()
        for var in self.data:
            all_vars.add(var)
            for source in self.data[var]:
                all_vars.add(source)

        for var in self.control:
            all_vars.add(var)
            for source in self.control[var]:
                all_vars.add(source)

        return all_vars
class Dependencies(Dependencies):
    def draw_edge(self, g: Digraph, mode: str,
                  node_from: str, node_to: str, **kwargs: Any) -> None:
        if mode == 'flow':
            g.edge(node_from, node_to, **kwargs)
        elif mode == 'depend':
            g.edge(node_from, node_to, dir="back", **kwargs)
        else:
            raise ValueError("`mode` must be 'flow' or 'depend'")

    def draw_dependencies(self, g: Digraph, mode: str) -> None:
        for var in self.all_vars():
            g.node(self.id(var),
                   label=self.label(var),
                   tooltip=self.tooltip(var))

            if var in self.data:
                for source in self.data[var]:
                    self.draw_edge(g, mode, self.id(source), self.id(var))

            if var in self.control:
                for source in self.control[var]:
                    self.draw_edge(g, mode, self.id(source), self.id(var),
                                   style='dashed', color='grey')

draw_dependencies() makes use of a few helper functions.

class Dependencies(Dependencies):
    def id(self, var: Node) -> str:
        """Return a unique ID for `var`."""
        id = ""
        # Avoid non-identifier characters
        for c in repr(var):
            if c.isalnum() or c == '_':
                id += c
            if c == ':' or c == ',':
                id += '_'
        return id

    def label(self, var: Node) -> str:
        """Render node `var` using HTML style."""
        (name, location) = var
        source = self.source(var)

        title = html.escape(name)
        if name.startswith('<'):
            title = f'<I>{title}</I>'

        label = f'<B>{title}</B>'
        if source:
            label += (f'<FONT POINT-SIZE="9.0"><BR/><BR/>'
                      f'{html.escape(source)}'
                      f'</FONT>')
        label = f'<{label}>'
        return label

    def tooltip(self, var: Node) -> str:
        """Return a tooltip for node `var`."""
        (name, location) = var
        func, lineno = location
        return f"{func.__name__}:{lineno}"

In the second part of graph drawing, add_hierarchy() adds invisible edges to ensure that nodes with lower line numbers are drawn above nodes with higher line numbers.

class Dependencies(Dependencies):
    def add_hierarchy(self, g: Digraph) -> Digraph:
        """Add invisible edges for a proper hierarchy."""
        functions = self.all_functions()
        for func in functions:
            last_var = None
            last_lineno = 0
            for (lineno, var) in functions[func]:
                if last_var is not None and lineno > last_lineno:
                    g.edge(self.id(last_var),
                           self.id(var),
                           style='invis')

                last_var = var
                last_lineno = lineno

        return g
class Dependencies(Dependencies):
    def all_functions(self) -> Dict[Callable, List[Tuple[int, Node]]]:
        """
        Return mapping 
        {`function`: [(`lineno`, `var`), (`lineno`, `var`), ...], ...}
        for all functions in the dependencies.
        """
        functions: Dict[Callable, List[Tuple[int, Node]]] = {}
        for var in self.all_vars():
            (name, location) = var
            func, lineno = location
            if func not in functions:
                functions[func] = []
            functions[func].append((lineno, var))

        for func in functions:
            functions[func].sort()

        return functions

Here comes the graph in all its glory:

def middle_deps() -> Dependencies:
    return Dependencies({('z', (middle, 1)): set(), ('y', (middle, 1)): set(), ('x', (middle, 1)): set(), ('<test>', (middle, 2)): {('y', (middle, 1)), ('z', (middle, 1))}, ('<test>', (middle, 3)): {('y', (middle, 1)), ('x', (middle, 1))}, ('<test>', (middle, 5)): {('z', (middle, 1)), ('x', (middle, 1))}, ('<middle() return value>', (middle, 6)): {('y', (middle, 1))}}, {('z', (middle, 1)): set(), ('y', (middle, 1)): set(), ('x', (middle, 1)): set(), ('<test>', (middle, 2)): set(), ('<test>', (middle, 3)): {('<test>', (middle, 2))}, ('<test>', (middle, 5)): {('<test>', (middle, 3))}, ('<middle() return value>', (middle, 6)): {('<test>', (middle, 5))}})
middle_deps()
dependencies test_functionmiddleat0x10a9cbf40_3 <test> if x < y: test_functionmiddleat0x10a9cbf40_5 <test> elif x < z: test_functionmiddleat0x10a9cbf40_3->test_functionmiddleat0x10a9cbf40_5 y_functionmiddleat0x10a9cbf40_1 y def middle(x, y, z):  # type: ignore y_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_3 test_functionmiddleat0x10a9cbf40_2 <test> if y < z: y_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_2 middlereturnvalue_functionmiddleat0x10a9cbf40_6 <middle() return value> return y y_functionmiddleat0x10a9cbf40_1->middlereturnvalue_functionmiddleat0x10a9cbf40_6 x_functionmiddleat0x10a9cbf40_1 x def middle(x, y, z):  # type: ignore x_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_3 x_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_5 test_functionmiddleat0x10a9cbf40_2->test_functionmiddleat0x10a9cbf40_3 z_functionmiddleat0x10a9cbf40_1 z def middle(x, y, z):  # type: ignore z_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_2 z_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_5 test_functionmiddleat0x10a9cbf40_5->middlereturnvalue_functionmiddleat0x10a9cbf40_6

By default, the arrow direction indicates flows – an arrow AB indicates that information from A flows into B (and thus the state in A causes the state in B). By setting the extra keyword parameter mode to depend instead of flow (default), you can reverse these arrows; then an arrow AB indicates A depends on B.

middle_deps().graph(mode='depend')
dependencies test_functionmiddleat0x10a9cbf40_3 <test> if x < y: test_functionmiddleat0x10a9cbf40_5 <test> elif x < z: test_functionmiddleat0x10a9cbf40_3->test_functionmiddleat0x10a9cbf40_5 y_functionmiddleat0x10a9cbf40_1 y def middle(x, y, z):  # type: ignore y_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_3 test_functionmiddleat0x10a9cbf40_2 <test> if y < z: y_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_2 middlereturnvalue_functionmiddleat0x10a9cbf40_6 <middle() return value> return y y_functionmiddleat0x10a9cbf40_1->middlereturnvalue_functionmiddleat0x10a9cbf40_6 x_functionmiddleat0x10a9cbf40_1 x def middle(x, y, z):  # type: ignore x_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_3 x_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_5 test_functionmiddleat0x10a9cbf40_2->test_functionmiddleat0x10a9cbf40_3 z_functionmiddleat0x10a9cbf40_1 z def middle(x, y, z):  # type: ignore z_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_2 z_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_5 test_functionmiddleat0x10a9cbf40_5->middlereturnvalue_functionmiddleat0x10a9cbf40_6

Slices

The method backward_slice(*critera, mode='cd') returns a subset of dependencies, following dependencies backward from the given slicing criteria criteria. These criteria can be

  • variable names (such as <test>); or
  • (function, lineno) pairs (such as (middle, 3)); or
  • (var_name, (function, lineno)) (such as (x, (middle, 1))) locations.

The extra parameter mode controls which dependencies are to be followed:

  • d = data dependencies
  • c = control dependencies
Criterion = Union[str, Location, Node]
class Dependencies(Dependencies):
    def expand_criteria(self, criteria: List[Criterion]) -> List[Node]:
        """Return list of vars matched by `criteria`."""
        all_vars = []
        for criterion in criteria:
            criterion_var = None
            criterion_func = None
            criterion_lineno = None

            if isinstance(criterion, str):
                criterion_var = criterion
            elif len(criterion) == 2 and callable(criterion[0]):
                criterion_func, criterion_lineno = criterion
            elif len(criterion) == 2 and isinstance(criterion[0], str):
                criterion_var = criterion[0]
                criterion_func, criterion_lineno = criterion[1]
            else:
                raise ValueError("Invalid argument")

            for var in self.all_vars():
                (var_name, location) = var
                func, lineno = location

                name_matches = (criterion_func is None or
                                criterion_func == func or
                                criterion_func.__name__ == func.__name__)

                location_matches = (criterion_lineno is None or
                                    criterion_lineno == lineno)

                var_matches = (criterion_var is None or
                               criterion_var == var_name)

                if name_matches and location_matches and var_matches:
                    all_vars.append(var)

        return all_vars

    def backward_slice(self, *criteria: Criterion, 
                       mode: str = 'cd', depth: int = -1) -> Dependencies:
        """
        Create a backward slice from nodes `criteria`.
        `mode` can contain 'c' (draw control dependencies)
        and 'd' (draw data dependencies) (default: 'cd')
        """
        data = {}
        control = {}
        queue = self.expand_criteria(criteria)
        seen = set()

        while len(queue) > 0 and depth != 0:
            var = queue[0]
            queue = queue[1:]
            seen.add(var)

            if 'd' in mode:
                # Follow data dependencies
                data[var] = self.data[var]
                for next_var in data[var]:
                    if next_var not in seen:
                        queue.append(next_var)
            else:
                data[var] = set()

            if 'c' in mode:
                # Follow control dependencies
                control[var] = self.control[var]
                for next_var in control[var]:
                    if next_var not in seen:
                        queue.append(next_var)
            else:
                control[var] = set()

            depth -= 1

        return Dependencies(data, control)

Data Dependencies

Here is an example of a data dependency in our middle() program. The value y returned by middle() comes from the value y as originally passed as argument. We use arrows $x \leftarrow y$ to indicate that a variable $x$ depends on an earlier variable $y$:

dependencies y_functionmiddleat0x10a9cbf40_1 y def middle(x, y, z):  # type: ignore middlereturnvalue_functionmiddleat0x10a9cbf40_6 <middle() return value> return y y_functionmiddleat0x10a9cbf40_1->middlereturnvalue_functionmiddleat0x10a9cbf40_6

Here, we can see that the value y in the return statement is data dependent on the value of y as passed to middle(). An alternate interpretation of this graph is a data flow: The value of y in the upper node flows into the value of y in the lower node.

Since we consider the values of variables at specific locations in the program, such data dependencies can also be interpreted as dependencies between statements – the above return statement thus is data dependent on the initialization of y in the upper node.

Control Dependencies

Here is an example of a control dependency. The execution of the above return statement is controlled by the earlier test x < z. We use gray dashed lines to indicate control dependencies:

dependencies middlereturnvalue_functionmiddleat0x10a9cbf40_6 <middle() return value> return y test_functionmiddleat0x10a9cbf40_5 <test> elif x < z: test_functionmiddleat0x10a9cbf40_5->middlereturnvalue_functionmiddleat0x10a9cbf40_6

This test in turn is controlled by earlier tests, so the full chain of control dependencies looks like this:

dependencies test_functionmiddleat0x10a9cbf40_3 <test> if x < y: test_functionmiddleat0x10a9cbf40_5 <test> elif x < z: test_functionmiddleat0x10a9cbf40_3->test_functionmiddleat0x10a9cbf40_5 test_functionmiddleat0x10a9cbf40_2 <test> if y < z: test_functionmiddleat0x10a9cbf40_2->test_functionmiddleat0x10a9cbf40_3 middlereturnvalue_functionmiddleat0x10a9cbf40_6 <middle() return value> return y test_functionmiddleat0x10a9cbf40_5->middlereturnvalue_functionmiddleat0x10a9cbf40_6

Dependency Graphs

The above <test> values (and their statements) are in turn also dependent on earlier data, namely the x, y, and z values as originally passed. We can draw all data and control dependencies in a single graph, called a program dependency graph:

dependencies test_functionmiddleat0x10a9cbf40_3 <test> if x < y: test_functionmiddleat0x10a9cbf40_5 <test> elif x < z: test_functionmiddleat0x10a9cbf40_3->test_functionmiddleat0x10a9cbf40_5 y_functionmiddleat0x10a9cbf40_1 y def middle(x, y, z):  # type: ignore y_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_3 test_functionmiddleat0x10a9cbf40_2 <test> if y < z: y_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_2 middlereturnvalue_functionmiddleat0x10a9cbf40_6 <middle() return value> return y y_functionmiddleat0x10a9cbf40_1->middlereturnvalue_functionmiddleat0x10a9cbf40_6 x_functionmiddleat0x10a9cbf40_1 x def middle(x, y, z):  # type: ignore x_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_3 x_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_5 test_functionmiddleat0x10a9cbf40_2->test_functionmiddleat0x10a9cbf40_3 z_functionmiddleat0x10a9cbf40_1 z def middle(x, y, z):  # type: ignore z_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_2 z_functionmiddleat0x10a9cbf40_1->test_functionmiddleat0x10a9cbf40_5 test_functionmiddleat0x10a9cbf40_5->middlereturnvalue_functionmiddleat0x10a9cbf40_6

This graph now gives us an idea on how to proceed to track the origins of the middle() return value at the bottom. Its value can come from any of the origins – namely the initialization of y at the function call, or from the <test> that controls it. This test in turn depends on x and z and their associated statements, which we now can check one after the other.

Note that all these dependencies in the graph are dynamic dependencies – that is, they refer to statements actually evaluated in the run at hand, as well as the decisions made in that very run. There also are static dependency graphs coming from static analysis of the code; but for debugging, dynamic dependencies specific to the failing run are more useful.

Showing Dependencies with Code

While a graph gives us a representation about which possible data and control flows to track, integrating dependencies with actual program code results in a compact representation that is easy to reason about.

Listing Dependencies

To show dependencies as text, we introduce a method format_var() that shows a single node (a variable) as text. By default, a node is referenced as

NAME (FUNCTION:LINENO)

However, within a given function, it makes no sense to re-state the function name again and again, so we have a shorthand

NAME (LINENO)

to state a dependency to variable NAME in line LINENO.

class Dependencies(Dependencies):
    def format_var(self, var: Node, current_func: Optional[Callable] = None) -> str:
        """Return string for `var` in `current_func`."""
        name, location = var
        func, lineno = location
        if current_func and (func == current_func or func.__name__ == current_func.__name__):
            return f"{name} ({lineno})"
        else:
            return f"{name} ({func.__name__}:{lineno})"

format_var() is used extensively in the __str__() string representation of dependencies, listing all nodes and their data (<=) and control (<-) dependencies.

class Dependencies(Dependencies):
    def __str__(self) -> str:
        """Return string representation of dependencies"""
        self.validate()

        out = ""
        for func in self.all_functions():
            code_name = func.__name__

            if out != "":
                out += "\n"
            out += f"{code_name}():\n"

            all_vars = list(set(self.data.keys()) | set(self.control.keys()))
            all_vars.sort(key=lambda var: var[1][1])

            for var in all_vars:
                (name, location) = var
                var_func, var_lineno = location
                var_code_name = var_func.__name__

                if var_code_name != code_name:
                    continue

                all_deps = ""
                for (source, arrow) in [(self.data, "<="), (self.control, "<-")]:
                    deps = ""
                    for data_dep in source[var]:
                        if deps == "":
                            deps = f" {arrow} "
                        else:
                            deps += ", "
                        deps += self.format_var(data_dep, func)

                    if deps != "":
                        if all_deps != "":
                            all_deps += ";"
                        all_deps += deps

                if all_deps == "":
                    continue

                out += ("    " + 
                        self.format_var(var, func) +
                        all_deps + "\n")

        return out

Here is a compact string representation of dependencies. We see how the (last) middle() return value has a data dependency to y in Line 1, and to the <test> in Line 5.

print(middle_deps())
middle():
    <test> (2) <= y (1), z (1)
    <test> (3) <= y (1), x (1); <- <test> (2)
    <test> (5) <= z (1), x (1); <- <test> (3)
    <middle() return value> (6) <= y (1); <- <test> (5)

The __repr__() method shows a raw form of dependencies, useful for creating dependencies from scratch.

class Dependencies(Dependencies):
    def repr_var(self, var: Node) -> str:
        name, location = var
        func, lineno = location
        return f"({repr(name)}, ({func.__name__}, {lineno}))"

    def repr_deps(self, var_set: Set[Node]) -> str:
        if len(var_set) == 0:
            return "set()"

        return ("{" +
                ", ".join(f"{self.repr_var(var)}"
                          for var in var_set) +
                "}")

    def repr_dependencies(self, vars: Dependency) -> str:
        return ("{\n        " +
                ",\n        ".join(
                    f"{self.repr_var(var)}: {self.repr_deps(vars[var])}"
                    for var in vars) +
                "}")

    def __repr__(self) -> str:
        """Represent dependencies as a Python expression"""
        # Useful for saving and restoring values
        return (f"Dependencies(\n" +
                f"    data={self.repr_dependencies(self.data)},\n" +
                f" control={self.repr_dependencies(self.control)})")
print(repr(middle_deps()))
Dependencies(
    data={
        ('z', (middle, 1)): set(),
        ('y', (middle, 1)): set(),
        ('x', (middle, 1)): set(),
        ('<test>', (middle, 2)): {('y', (middle, 1)), ('z', (middle, 1))},
        ('<test>', (middle, 3)): {('y', (middle, 1)), ('x', (middle, 1))},
        ('<test>', (middle, 5)): {('z', (middle, 1)), ('x', (middle, 1))},
        ('<middle() return value>', (middle, 6)): {('y', (middle, 1))}},
 control={
        ('z', (middle, 1)): set(),
        ('y', (middle, 1)): set(),
        ('x', (middle, 1)): set(),
        ('<test>', (middle, 2)): set(),
        ('<test>', (middle, 3)): {('<test>', (middle, 2))},
        ('<test>', (middle, 5)): {('<test>', (middle, 3))},
        ('<middle() return value>', (middle, 6)): {('<test>', (middle, 5))}})

An even more useful representation comes when integrating these dependencies as comments into the code. The method code(item_1, item_2, ...) lists the given (function) items, including their dependencies; code() lists all functions contained in the dependencies.

class Dependencies(Dependencies):
    def code(self, *items: Callable, mode: str = 'cd') -> None:
        """
        List `items` on standard output, including dependencies as comments. 
        If `items` is empty, all included functions are listed.
        `mode` can contain 'c' (draw control dependencies) and 'd' (draw data dependencies)
        (default: 'cd').
        """

        if len(items) == 0:
            items = cast(Tuple[Callable], self.all_functions().keys())

        for i, item in enumerate(items):
            if i > 0:
                print()
            self._code(item, mode)

    def _code(self, item: Callable, mode: str) -> None:
        # The functions in dependencies may be (instrumented) copies
        # of the original function. Find the function with the same name.
        func = item
        for fn in self.all_functions():
            if fn == item or fn.__name__ == item.__name__:
                func = fn
                break

        all_vars = self.all_vars()
        slice_locations = set(location for (name, location) in all_vars)

        source_lines, first_lineno = inspect.getsourcelines(func)

        n = first_lineno
        for line in source_lines:
            line_location = (func, n)
            if line_location in slice_locations:
                prefix = "* "
            else:
                prefix = "  "

            print(f"{prefix}{n:4} ", end="")

            comment = ""
            for (mode_control, source, arrow) in [
                ('d', self.data, '<='),
                ('c', self.control, '<-')
            ]:
                if mode_control not in mode:
                    continue

                deps = ""
                for var in source:
                    name, location = var
                    if location == line_location:
                        for dep_var in source[var]:
                            if deps == "":
                                deps = arrow + " "
                            else:
                                deps += ", "
                            deps += self.format_var(dep_var, item)

                if deps != "":
                    if comment != "":
                        comment += "; "
                    comment += deps

            if comment != "":
                line = line.rstrip() + "  # " + comment

            print_content(line.rstrip(), '.py')
            print()
            n += 1

The following listing shows such an integration. For each executed line (*), we see its data (<=) and control (<-) dependencies, listing the associated variables and line numbers. The comment

# <= y (1); <- <test> (5)

for Line 6, for instance, states that the return value is data dependent on the value of y in Line 1, and control dependent on the test in Line 5.

Again, one can easily follow these dependencies back to track where a value came from (data dependencies) and why a statement was executed (control dependency).

*    1 def middle(x, y, z):  # type: ignore
*    2     if y < z:  # <= y (1), z (1)
*    3         if x < y:  # <= y (1), x (1); <- <test> (2)
     4             return y
*    5         elif x < z:  # <= z (1), x (1); <- <test> (3)
*    6             return y  # <= y (1); <- <test> (5)
     7     else:
     8         if x > y:
     9             return y
    10         elif x > z:
    11             return x
    12     return z

One important aspect of dependencies is that they not only point to specific sources and causes of failures – but that they also rule out parts of program and state as failures.

  • In the above code, Lines 8 and later have no influence on the output, simply because they were not executed.
  • Furthermore, we see that we can start our investigation with Line 6, because that is the last one executed.
  • The data dependencies tell us that no statement has interfered with the value of y between the function call and its return.
  • Hence, the error must be in the conditions or the final return statement.

With this in mind, recall that our original invocation was middle(2, 1, 3). Why and how is the above code wrong?

Quiz

Which of the following middle() code lines should be fixed?





Indeed, from the controlling conditions, we see that y < z, x >= y, and x < z all hold. Hence, y <= x < z holds, and it is x, not y, that should be returned.

Slices

Given a dependency graph for a particular variable, we can identify the subset of the program that could have influenced it – the so-called slice. In the above code listing, these code locations are highlighted with * characters. Only these locations are part of the slice.

Slices are central to debugging for two reasons:

  • First, they rule out those locations of the program that could not have an effect on the failure. Hence, these locations need not be investigated as it comes to searching for the defect. Nor do they need to be considered for a fix, as any change outside the program slice by construction cannot affect the failure.
  • Second, they bring together possible origins that may be scattered across the code. Many dependencies in program code are non-local, with references to functions, classes, and modules defined in other locations, files, or libraries. A slice brings together all those locations in a single whole.

Here is an example of a slice – this time for our well-known remove_html_markup() function from the introduction to debugging:

from Intro_Debugging import remove_html_markup
print_content(inspect.getsource(remove_html_markup), '.py')
def remove_html_markup(s):  # type: ignore
    tag = False
    quote = False
    out = ""

    for c in s:
        assert tag or not quote

        if c == '<' and not quote:
            tag = True
        elif c == '>' and not quote:
            tag = False
        elif (c == '"' or c == "'") and tag:
            quote = not quote
        elif not tag:
            out = out + c

    return out

When we invoke remove_html_markup() as follows...

remove_html_markup('<foo>bar</foo>')
'bar'

... we obtain the following dependencies:

dependencies tag_functionremove_html_markupat0x10a9f8d30_147 tag <remove_html_markup()> test_functionremove_html_markupat0x10a9f8d30_148 <test> <remove_html_markup()> test_functionremove_html_markupat0x10a9f8d30_150 <test> <remove_html_markup()> tag_functionremove_html_markupat0x10a9f8d30_147->test_functionremove_html_markupat0x10a9f8d30_150 test_functionremove_html_markupat0x10a9f8d30_146 <test> <remove_html_markup()> test_functionremove_html_markupat0x10a9f8d30_146->tag_functionremove_html_markupat0x10a9f8d30_147 test_functionremove_html_markupat0x10a9f8d30_146->test_functionremove_html_markupat0x10a9f8d30_148 remove_html_markupreturnvalue_functionremove_html_markupat0x10a9f8d30_153 <remove_html_markup() return value> <remove_html_markup()> test_functionremove_html_markupat0x10a9f8d30_146->remove_html_markupreturnvalue_functionremove_html_markupat0x10a9f8d30_153 quote_functionremove_html_markupat0x10a9f8d30_138 quote <remove_html_markup()> quote_functionremove_html_markupat0x10a9f8d30_138->test_functionremove_html_markupat0x10a9f8d30_146 test_functionremove_html_markupat0x10a9f8d30_144 <test> <remove_html_markup()> quote_functionremove_html_markupat0x10a9f8d30_138->test_functionremove_html_markupat0x10a9f8d30_144 out_functionremove_html_markupat0x10a9f8d30_139 out <remove_html_markup()> c_functionremove_html_markupat0x10a9f8d30_141 c <remove_html_markup()> c_functionremove_html_markupat0x10a9f8d30_141->test_functionremove_html_markupat0x10a9f8d30_146 c_functionremove_html_markupat0x10a9f8d30_141->test_functionremove_html_markupat0x10a9f8d30_144 c_functionremove_html_markupat0x10a9f8d30_141->test_functionremove_html_markupat0x10a9f8d30_148 out_functionremove_html_markupat0x10a9f8d30_151 out <remove_html_markup()> c_functionremove_html_markupat0x10a9f8d30_141->out_functionremove_html_markupat0x10a9f8d30_151 test_functionremove_html_markupat0x10a9f8d30_144->test_functionremove_html_markupat0x10a9f8d30_146 tag_functionremove_html_markupat0x10a9f8d30_145 tag <remove_html_markup()> test_functionremove_html_markupat0x10a9f8d30_144->tag_functionremove_html_markupat0x10a9f8d30_145 out_functionremove_html_markupat0x10a9f8d30_139->out_functionremove_html_markupat0x10a9f8d30_151 s_functionremove_html_markupat0x10a9f8d30_136 s <remove_html_markup()> s_functionremove_html_markupat0x10a9f8d30_136->c_functionremove_html_markupat0x10a9f8d30_141 tag_functionremove_html_markupat0x10a9f8d30_137 tag <remove_html_markup()> test_functionremove_html_markupat0x10a9f8d30_148->test_functionremove_html_markupat0x10a9f8d30_150 test_functionremove_html_markupat0x10a9f8d30_150->out_functionremove_html_markupat0x10a9f8d30_151 tag_functionremove_html_markupat0x10a9f8d30_145->test_functionremove_html_markupat0x10a9f8d30_150 out_functionremove_html_markupat0x10a9f8d30_151->out_functionremove_html_markupat0x10a9f8d30_151 out_functionremove_html_markupat0x10a9f8d30_151->remove_html_markupreturnvalue_functionremove_html_markupat0x10a9f8d30_153

Again, we can read such a graph forward (starting from, say, s) or backward (starting from the return value). Starting forward, we see how the passed string s flows into the for loop, breaking s into individual characters c that are then checked on various occasions, before flowing into the out return value. We also see how the various if conditions are all influenced by c, tag, and quote.

Quiz

Why does the first line tag = False not influence anything?





Which are the locations that set tag to True? To this end, we compute the slice of tag at tag = True:

dependencies quote_functionremove_html_markupat0x10a9f8d30_138 quote <remove_html_markup()> test_functionremove_html_markupat0x10a9f8d30_144 <test> <remove_html_markup()> quote_functionremove_html_markupat0x10a9f8d30_138->test_functionremove_html_markupat0x10a9f8d30_144 c_functionremove_html_markupat0x10a9f8d30_141 c <remove_html_markup()> tag_functionremove_html_markupat0x10a9f8d30_145 tag <remove_html_markup()> test_functionremove_html_markupat0x10a9f8d30_144->tag_functionremove_html_markupat0x10a9f8d30_145 c_functionremove_html_markupat0x10a9f8d30_141->test_functionremove_html_markupat0x10a9f8d30_144 s_functionremove_html_markupat0x10a9f8d30_136 s <remove_html_markup()> s_functionremove_html_markupat0x10a9f8d30_136->c_functionremove_html_markupat0x10a9f8d30_141

We see where the value of tag comes from: from the characters c in s as well as quote, which all cause it to be set. Again, we can combine these dependencies and the listing in a single, compact view. Note, again, that there are no other locations in the code that could possibly have affected tag in our run.

   238 def remove_html_markup(s):  # type: ignore
   239     tag = False
   240     quote = False
   241     out = ""
   242 
   243     for c in s:
   244         assert tag or not quote
   245 
   246         if c == '<' and not quote:
   247             tag = True
   248         elif c == '>' and not quote:
   249             tag = False
   250         elif (c == '"' or c == "'") and tag:
   251             quote = not quote
   252         elif not tag:
   253             out = out + c
   254 
   255     return out

Quiz

How does the slice of tag = True change for a different value of s?




Indeed, our dynamic slices reflect dependencies as they occurred within a single execution. As the execution changes, so do the dependencies.

Tracking Techniques

For the remainder of this chapter, let us investigate means to determine such dependencies automatically – by collecting them during program execution. The idea is that with a single Python call, we can collect the dependencies for some computation, and present them to programmers – as graphs or as code annotations, as shown above.

To track dependencies, for every variable, we need to keep track of its origins – where it obtained its value, and which tests controlled its assignments. There are two ways to do so:

  • Wrapping Data Objects
  • Wrapping Data Accesses

Wrapping Data Objects

One way to track origins is to wrap each value in a class that stores both a value and the origin of the value. If a variable x is initialized to zero in Line 3, for instance, we could store it as

x = (value=0, origin=<Line 3>)

and if it is copied in, say, Line 5 to another variable y, we could store this as

y = (value=0, origin=<Line 3, Line 5>)

Such a scheme would allow us to track origins and dependencies right within the variable.

In a language like Python, it is actually possibly to subclass from basic types. Here's how we create a MyInt subclass of int:

class MyInt(int):
    def __new__(cls: Type, value: Any, *args: Any, **kwargs: Any) -> Any:
        return super(cls, cls).__new__(cls, value)

    def __repr__(self) -> str:
        return f"{int(self)}"
n: MyInt = MyInt(5)

We can access n just like any integer:

n, n + 1
(5, 6)

However, we can also add extra attributes to it:

n.origin = "Line 5"
n.origin
'Line 5'

Such a "wrapping" scheme has the advantage of leaving program code untouched – simply pass "wrapped" objects instead of the original values. However, it also has a number of drawbacks.

  • First, we must make sure that the "wrapper" objects are still compatible with the original values – notably by converting them back whenever needed. (What happens if an internal Python function expects an int and gets a MyInt instead?)
  • Second, we have to make sure that origins do not get lost during computations – which involves overloading operators such as +, -, *, and so on. (Right now, MyInt(1) + 1 gives us an int object, not a MyInt.)
  • Third, we have to do this for all data types of a language, which is pretty tedious.
  • Fourth and last, however, we want to track whenever a value is assigned to another variable. Python has no support for this, and thus our dependencies will necessarily be incomplete.

Wrapping Data Accesses

An alternate way of tracking origins is to instrument the source code such that all data read and write operations are tracked. That is, the original data stays unchanged, but we change the code instead.

In essence, for every occurrence of a variable x being read, we replace it with

_data.get('x', x)  # returns x

and for every occurrence of a value being written to x, we replace the value with

_data.set('x', value)  # returns value

and let the _data object track these reads and writes.

Hence, an assignment such as

a = b + c

would get rewritten to

a = _data.set('a', _data.get('b', b) + _data.get('c', c))

and with every access to _data, we would track

  1. the current location in the code, and
  2. whether the respective variable was read or written.

For the above statement, we could deduce that b and c were read, and a was written – which makes a data dependent on b and c.

The advantage of such instrumentation is that it works with arbitrary objects (in Python, that is) – we do not care whether a, b, and c would be integers, floats, strings, lists or any other type for which + would be defined. Also, the code semantics remain entirely unchanged.

The disadvantage, however, is that it takes a bit of effort to exactly separate reads and writes into individual groups, and that a number of language features have to be handled separately. This is what we do in the remainder of this chapter.

A Data Tracker

To implement _data accesses as shown above, we introduce the DataTracker class. As its name suggests, it keeps track of variables being read and written, and provides methods to determine the code location where this took place.

class DataTracker(StackInspector):
    """Track data accesses during execution"""

    def __init__(self, log: bool = False) -> None:
        """Constructor. If `log` is set, turn on logging."""
        self.log = log

set() is invoked when a variable is set, as in

pi = _data.set('pi', 3.1415)

By default, we simply log the access using name and value. (loads will be used later.)

class DataTracker(DataTracker):
    def set(self, name: str, value: Any, loads: Optional[Set[str]] = None) -> Any:
        """Track setting `name` to `value`."""
        if self.log:
            caller_func, lineno = self.caller_location()
            print(f"{caller_func.__name__}:{lineno}: setting {name}")

        return value

get() is invoked when a variable is retrieved, as in

print(_data.get('pi', pi))

By default, we simply log the access.

class DataTracker(DataTracker):
    def get(self, name: str, value: Any) -> Any:
        """Track getting `value` from `name`."""

        if self.log:
            caller_func, lineno = self.caller_location()
            print(f"{caller_func.__name__}:{lineno}: getting {name}")

        return value

Here's an example of a logging DataTracker:

_test_data = DataTracker(log=True)
x = _test_data.set('x', 1)
<cell line: 2>:2: setting x
_test_data.get('x', x)
<cell line: 1>:1: getting x
1

Instrumenting Source Code

How do we transform source code such that read and write accesses to variables would be automatically rewritten? To this end, we inspect the internal representation of source code, namely the abstract syntax trees (ASTs). An AST represents the code as a tree, with specific node types for each syntactical element.

import ast
from bookutils import show_ast

Here is the tree representation for our middle() function. It starts with a FunctionDef node at the top (with the name "middle" and the three arguments x, y, z as children), followed by a subtree for each of the If statements, each of which contains a branch for when their condition evaluates to True and a branch for when their condition evaluates to False.

middle_tree = ast.parse(inspect.getsource(middle))
show_ast(middle_tree)
0 FunctionDef 1 "middle" 0--1 2 arguments 0--2 9 If 0--9 70 Return 0--70 3 arg 2--3 5 arg 2--5 7 arg 2--7 4 "x" 3--4 6 "y" 5--6 8 "z" 7--8 10 Compare 9--10 18 If 9--18 44 If 9--44 11 Name 10--11 14 Lt 10--14 15 Name 10--15 12 "y" 11--12 13 Load 11--13 16 "z" 15--16 17 Load 15--17 19 Compare 18--19 27 Return 18--27 31 If 18--31 20 Name 19--20 23 Lt 19--23 24 Name 19--24 21 "x" 20--21 22 Load 20--22 25 "y" 24--25 26 Load 24--26 28 Name 27--28 29 "y" 28--29 30 Load 28--30 32 Compare 31--32 40 Return 31--40 33 Name 32--33 36 Lt 32--36 37 Name 32--37 34 "x" 33--34 35 Load 33--35 38 "z" 37--38 39 Load 37--39 41 Name 40--41 42 "y" 41--42 43 Load 41--43 45 Compare 44--45 53 Return 44--53 57 If 44--57 46 Name 45--46 49 Gt 45--49 50 Name 45--50 47 "x" 46--47 48 Load 46--48 51 "y" 50--51 52 Load 50--52 54 Name 53--54 55 "y" 54--55 56 Load 54--56 58 Compare 57--58 66 Return 57--66 59 Name 58--59 62 Gt 58--62 63 Name 58--63 60 "x" 59--60 61 Load 59--61 64 "z" 63--64 65 Load 63--65 67 Name 66--67 68 "x" 67--68 69 Load 67--69 71 Name 70--71 72 "z" 71--72 73 Load 71--73

At the very bottom of the tree, you can see a number of Name nodes, referring individual variables. These are the ones we want to transform.

Tracking Variable Accesses

Our goal is to traverse the tree, identify all Name nodes, and convert them to respective _data accesses. To this end, we manipulate the AST through the ast Python module ast. The official Python ast reference is complete, but a bit brief; the documentation "Green Tree Snakes - the missing Python AST docs" provides an excellent introduction.

The Python ast module provides a class NodeTransformer that allows such transformations. By subclassing from it, we provide a method visit_Name() that will be invoked for all Name nodes – and replace it by a new subtree from make_get_data():

from ast import NodeTransformer, NodeVisitor, Name, AST
import typing
DATA_TRACKER = '_data'
def is_internal(id: str) -> bool:
    """Return True if `id` is a built-in function or type"""
    return (id in dir(__builtins__) or id in dir(typing))
assert is_internal('int')
assert is_internal('None')
assert is_internal('Tuple')
class TrackGetTransformer(NodeTransformer):
    def visit_Name(self, node: Name) -> AST:
        self.generic_visit(node)

        if is_internal(node.id):
            # Do not change built-in names and types
            return node

        if node.id == DATA_TRACKER:
            # Do not change own accesses
            return node

        if not isinstance(node.ctx, Load):
            # Only change loads (not stores, not deletions)
            return node

        new_node = make_get_data(node.id)
        ast.copy_location(new_node, node)
        return new_node

Our function make_get_data(id, method) returns a new subtree equivalent to the Python code _data.method('id', id).

from ast import Module, Load, Store, \
    Attribute, With, withitem, keyword, Call, Expr, \
    Assign, AugAssign, AnnAssign, Assert
# Starting with Python 3.8, these will become Constant.
# from ast import Num, Str, NameConstant
# Use `ast.Num`, `ast.Str`, and `ast.NameConstant` for compatibility
def make_get_data(id: str, method: str = 'get') -> Call:
    return Call(func=Attribute(value=Name(id=DATA_TRACKER, ctx=Load()), 
                               attr=method, ctx=Load()),
                args=[ast.Str(s=id), Name(id=id, ctx=Load())],
                keywords=[])

This is the tree that make_get_data() produces:

show_ast(Module(body=[make_get_data("x")]))
0 Call 1 Attribute 0--1 7 Constant 0--7 9 Name 0--9 2 Name 1--2 5 "get" 1--5 6 Load 1--6 3 "_data" 2--3 4 Load 2--4 8 "x" 7--8 10 "x" 9--10 11 Load 9--11

How do we know that this is a correct subtree? We can carefully read the official Python ast reference and then proceed by trial and error (and apply delta debugging to determine error causes). Or – pro tip! – we can simply take a piece of Python code, parse it and use ast.dump() to print out how to construct the resulting AST:

print(ast.dump(ast.parse("_data.get('x', x)")))
Module(body=[Expr(value=Call(func=Attribute(value=Name(id='_data', ctx=Load()), attr='get', ctx=Load()), args=[Constant(value='x'), Name(id='x', ctx=Load())], keywords=[]))], type_ignores=[])

If you compare the above output with the code of make_get_data(), above, you will find out where the source of make_get_data() comes from.

Let us put TrackGetTransformer to action. Its visit() method calls visit_Name(), which then in turn transforms the Name nodes as we want it. This happens in place.

TrackGetTransformer().visit(middle_tree);

To see the effect of our transformations, we introduce a method dump_tree() which outputs the tree – and also compiles it to check for any inconsistencies.

def dump_tree(tree: AST) -> None:
    print_content(ast.unparse(tree), '.py')
    ast.fix_missing_locations(tree)  # Must run this before compiling
    _ = compile(cast(ast.Module, tree), '<dump_tree>', 'exec')

We see that our transformer has properly replaced all variable accesses:

dump_tree(middle_tree)
def middle(x, y, z):
    if _data.get('y', y) < _data.get('z', z):
        if _data.get('x', x) < _data.get('y', y):
            return _data.get('y', y)
        elif _data.get('x', x) < _data.get('z', z):
            return _data.get('y', y)
    elif _data.get('x', x) > _data.get('y', y):
        return _data.get('y', y)
    elif _data.get('x', x) > _data.get('z', z):
        return _data.get('x', x)
    return _data.get('z', z)

Let us now execute this code together with the DataTracker() class we previously introduced. The class DataTrackerTester() takes a (transformed) tree and a function. Using it as

with DataTrackerTester(tree, func):
    func(...)

first executes the code in tree (possibly instrumenting func) and then the with body. At the end, func is restored to its previous (non-instrumented) version.

from types import TracebackType
class DataTrackerTester:
    def __init__(self, tree: AST, func: Callable, log: bool = True) -> None:
        """Constructor. Execute the code in `tree` while instrumenting `func`."""
        # We pass the source file of `func` such that we can retrieve it
        # when accessing the location of the new compiled code
        source = cast(str, inspect.getsourcefile(func))
        self.code = compile(cast(ast.Module, tree), source, 'exec')
        self.func = func
        self.log = log

    def make_data_tracker(self) -> Any:
        return DataTracker(log=self.log)

    def __enter__(self) -> Any:
        """Rewrite function"""
        tracker = self.make_data_tracker()
        globals()[DATA_TRACKER] = tracker
        exec(self.code, globals())
        return tracker

    def __exit__(self, exc_type: Type, exc_value: BaseException,
                 traceback: TracebackType) -> Optional[bool]:
        """Restore function"""
        globals()[self.func.__name__] = self.func
        del globals()[DATA_TRACKER]
        return None

Here is our middle() function:

print_content(inspect.getsource(middle), '.py', start_line_number=1)
 1  def middle(x, y, z):  # type: ignore
 2      if y < z:
 3          if x < y:
 4              return y
 5          elif x < z:
 6              return y
 7      else:
 8          if x > y:
 9              return y
10          elif x > z:
11              return x
12      return z

And here is our instrumented middle_tree executed with a DataTracker object. We see how the middle() tests access one argument after another.

with DataTrackerTester(middle_tree, middle):
    middle(2, 1, 3)
middle:2: getting y
middle:2: getting z
middle:3: getting x
middle:3: getting y
middle:5: getting x
middle:5: getting z
middle:6: getting y

After DataTrackerTester is done, middle is reverted to its non-instrumented version:

middle(2, 1, 3)
1

For a complete picture of what happens during executions, we implement a number of additional code transformers.

For each assignment statement x = y, we change it to x = _data.set('x', y). This allows tracking assignments.

Tracking Assignments and Assertions

For the remaining transformers, we follow the same steps as for TrackGetTransformer, except that our visit_...() methods focus on different nodes, and return different subtrees. Here, we focus on assignment nodes.

We want to transform assignments x = value into _data.set('x', value) to track assignments to x.

If the left-hand side of the assignment is more complex, as in x[y] = value, we want to ensure the read access to x and y is also tracked. By transforming x[y] = value into _data.set('x', value, loads=(x, y)), we ensure that x and y are marked as read (as the otherwise ignored loads argument would be changed to _data.get() calls for x and y).

Using ast.dump(), we reveal what the corresponding syntax tree has to look like:

print(ast.dump(ast.parse("_data.set('x', value, loads=(a, b))")))
Module(body=[Expr(value=Call(func=Attribute(value=Name(id='_data', ctx=Load()), attr='set', ctx=Load()), args=[Constant(value='x'), Name(id='value', ctx=Load())], keywords=[keyword(arg='loads', value=Tuple(elts=[Name(id='a', ctx=Load()), Name(id='b', ctx=Load())], ctx=Load()))]))], type_ignores=[])

Using this structure, we can write a function make_set_data() which constructs such a subtree.

def make_set_data(id: str, value: Any, 
                  loads: Optional[Set[str]] = None, method: str = 'set') -> Call:
    """
    Construct a subtree _data.`method`('`id`', `value`). 
    If `loads` is set to [X1, X2, ...], make it
    _data.`method`('`id`', `value`, loads=(X1, X2, ...))
    """

    keywords=[]

    if loads:
        keywords = [
            keyword(arg='loads',
                    value=ast.Tuple(
                        elts=[Name(id=load, ctx=Load()) for load in loads],
                        ctx=Load()
                    ))
        ]

    new_node = Call(func=Attribute(value=Name(id=DATA_TRACKER, ctx=Load()),
                                   attr=method, ctx=Load()),
                    args=[ast.Str(s=id), value],
                    keywords=keywords)

    ast.copy_location(new_node, value)

    return new_node

The problem is, however: How do we get the name of the variable being assigned to? The left-hand side of an assignment can be a complex expression such as x[i]. We use the leftmost name of the left-hand side as name to be assigned to.

class LeftmostNameVisitor(NodeVisitor):
    def __init__(self) -> None:
        super().__init__()
        self.leftmost_name: Optional[str] = None

    def visit_Name(self, node: Name) -> None:
        if self.leftmost_name is None:
            self.leftmost_name = node.id
        self.generic_visit(node)
def leftmost_name(tree: AST) -> Optional[str]:
    visitor = LeftmostNameVisitor()
    visitor.visit(tree)
    return visitor.leftmost_name
leftmost_name(ast.parse('a[x] = 25'))
'a'

Python also allows tuple assignments, as in (a, b, c) = (1, 2, 3). We extract all variables being stored (that is, expressions whose ctx attribute is Store()) and extract their (leftmost) names.

class StoreVisitor(NodeVisitor):
    def __init__(self) -> None:
        super().__init__()
        self.names: Set[str] = set()

    def visit(self, node: AST) -> None:
        if hasattr(node, 'ctx') and isinstance(node.ctx, Store):
            name = leftmost_name(node)
            if name:
                self.names.add(name)

        self.generic_visit(node)
def store_names(tree: AST) -> Set[str]:
    visitor = StoreVisitor()
    visitor.visit(tree)
    return visitor.names
store_names(ast.parse('a[x], b[y], c = 1, 2, 3'))
{'a', 'b', 'c'}

For complex assignments, we also want to access the names read in the left hand side of an expression.

class LoadVisitor(NodeVisitor):
    def __init__(self) -> None:
        super().__init__()
        self.names: Set[str] = set()

    def visit(self, node: AST) -> None:
        if hasattr(node, 'ctx') and isinstance(node.ctx, Load):
            name = leftmost_name(node)
            if name is not None:
                self.names.add(name)

        self.generic_visit(node)
def load_names(tree: AST) -> Set[str]:
    visitor = LoadVisitor()
    visitor.visit(tree)
    return visitor.names
load_names(ast.parse('a[x], b[y], c = 1, 2, 3'))
{'a', 'b', 'x', 'y'}

With this, we can now define TrackSetTransformer as a transformer for regular assignments. Note that in Python, an assignment can have multiple targets, as in a = b = c; we assign the data dependencies of c to them all.

class TrackSetTransformer(NodeTransformer):
    def visit_Assign(self, node: Assign) -> Assign:
        value = ast.unparse(node.value)
        if value.startswith(DATA_TRACKER + '.set'):
            return node  # Do not apply twice

        for target in node.targets:
            loads = load_names(target)
            for store_name in store_names(target):
                node.value = make_set_data(store_name, node.value, 
                                           loads=loads)
                loads = set()

        return node

The special form of "augmented assign" needs special treatment. We change statements of the form x += y to x += _data.augment('x', y).

class TrackSetTransformer(TrackSetTransformer):
    def visit_AugAssign(self, node: AugAssign) -> AugAssign:
        value = ast.unparse(node.value)
        if value.startswith(DATA_TRACKER):
            return node  # Do not apply twice

        id = cast(str, leftmost_name(node.target))
        node.value = make_set_data(id, node.value, method='augment')

        return node

The corresponding augment() method uses a combination of set() and get() to reflect the semantics.

class DataTracker(DataTracker):
    def augment(self, name: str, value: Any) -> Any:
        """
        Track augmenting `name` with `value`.
        To be overloaded in subclasses.
        """
        self.set(name, self.get(name, value))
        return value

"Annotated" assignments come with a type, as in x: int = 3. We treat them like regular assignments.

class TrackSetTransformer(TrackSetTransformer):
    def visit_AnnAssign(self, node: AnnAssign) -> AnnAssign:
        if node.value is None:
            return node  # just <var>: <type> without value

        value = ast.unparse(node.value)
        if value.startswith(DATA_TRACKER + '.set'):
            return node  # Do not apply twice

        loads = load_names(node.target)
        for store_name in store_names(node.target):
            node.value = make_set_data(store_name, node.value, 
                                       loads=loads)
            loads = set()

        return node

Finally, we treat assertions just as if they were assignments to special variables:

class TrackSetTransformer(TrackSetTransformer):
    def visit_Assert(self, node: Assert) -> Assert:
        value = ast.unparse(node.test)
        if value.startswith(DATA_TRACKER + '.set'):
            return node  # Do not apply twice

        loads = load_names(node.test)
        node.test = make_set_data("<assertion>", node.test, loads=loads)
        return node

Here's all of these transformers in action. Our original function has a number of assignments:

def assign_test(x: int) -> Tuple[int, str]:
    forty_two: int = 42
    forty_two = forty_two = 42
    a, b = 1, 2
    c = d = [a, b]
    c[d[a]].attr = 47
    a *= b + 1
    assert a > 0
    return forty_two, "Forty-Two"
assign_tree = ast.parse(inspect.getsource(assign_test))
TrackSetTransformer().visit(assign_tree)
dump_tree(assign_tree)
def assign_test(x: int) -> Tuple[int, str]:
    forty_two: int = _data.set('forty_two', 42)
    forty_two = forty_two = _data.set('forty_two', _data.set('forty_two', 42))
    (a, b) = _data.set('a', _data.set('b', (1, 2)))
    c = d = _data.set('d', _data.set('c', [a, b]))
    c[d[a]].attr = _data.set('c', 47, loads=(c, a, d))
    a *= _data.augment('a', b + 1)
    assert _data.set('<assertion>', a > 0, loads=(a,))
    return (forty_two, 'Forty-Two')

If we later apply our transformer for data accesses, we can see that we track all variable reads and writes.

TrackGetTransformer().visit(assign_tree)
dump_tree(assign_tree)
def assign_test(x: int) -> Tuple[int, str]:
    forty_two: int = _data.set('forty_two', 42)
    forty_two = forty_two = _data.set('forty_two', _data.set('forty_two', 42))
    (a, b) = _data.set('a', _data.set('b', (1, 2)))
    c = d = _data.set('d', _data.set('c', [_data.get('a', a), _data.get('b', b)]))
    _data.get('c', c)[_data.get('d', d)[_data.get('a', a)]].attr = _data.set('c', 47, loads=(_data.get('c', c), _data.get('a', a), _data.get('d', d)))
    a *= _data.augment('a', _data.get('b', b) + 1)
    assert _data.set('<assertion>', _data.get('a', a) > 0, loads=(_data.get('a', a),))
    return (_data.get('forty_two', forty_two), 'Forty-Two')

Each return statement return x is transformed to return _data.set('<return_value>', x). This allows tracking return values.

Tracking Return Values

Our TrackReturnTransformer also makes use of make_set_data().

class TrackReturnTransformer(NodeTransformer):
    def __init__(self) -> None:
        self.function_name: Optional[str] = None
        super().__init__()

    def visit_FunctionDef(self, node: Union[ast.FunctionDef, ast.AsyncFunctionDef]) -> AST:
        outer_name = self.function_name
        self.function_name = node.name  # Save current name
        self.generic_visit(node)
        self.function_name = outer_name
        return node

    def visit_AsyncFunctionDef(self, node: ast.AsyncFunctionDef) -> AST:
        return self.visit_FunctionDef(node)

    def return_value(self, tp: str = "return") -> str:
        if self.function_name is None:
            return f"<{tp} value>"
        else:
            return f"<{self.function_name}() {tp} value>"

    def visit_return_or_yield(self, node: Union[ast.Return, ast.Yield, ast.YieldFrom],
                              tp: str = "return") -> AST:

        if node.value is not None:
            value = ast.unparse(node.value)
            if not value.startswith(DATA_TRACKER + '.set'):
                node.value = make_set_data(self.return_value(tp), node.value)

        return node

    def visit_Return(self, node: ast.Return) -> AST:
        return self.visit_return_or_yield(node, tp="return")

    def visit_Yield(self, node: ast.Yield) -> AST:
        return self.visit_return_or_yield(node, tp="yield")

    def visit_YieldFrom(self, node: ast.YieldFrom) -> AST:
        return self.visit_return_or_yield(node, tp="yield")

This is the effect of TrackReturnTransformer. We see that all return values are saved, and thus all locations of the corresponding return statements are tracked.

TrackReturnTransformer().visit(middle_tree)
dump_tree(middle_tree)
def middle(x, y, z):
    if _data.get('y', y) < _data.get('z', z):
        if _data.get('x', x) < _data.get('y', y):
            return _data.set('<middle() return value>', _data.get('y', y))
        elif _data.get('x', x) < _data.get('z', z):
            return _data.set('<middle() return value>', _data.get('y', y))
    elif _data.get('x', x) > _data.get('y', y):
        return _data.set('<middle() return value>', _data.get('y', y))
    elif _data.get('x', x) > _data.get('z', z):
        return _data.set('<middle() return value>', _data.get('x', x))
    return _data.set('<middle() return value>', _data.get('z', z))
with DataTrackerTester(middle_tree, middle):
    middle(2, 1, 3)
middle:2: getting y
middle:2: getting z
middle:3: getting x
middle:3: getting y
middle:5: getting x
middle:5: getting z
middle:6: getting y
middle:6: setting <middle() return value>

To track control dependencies, for every block controlled by an if, while, or for:

  1. We wrap their tests in a _data.test() wrapper. This allows us to assign pseudo-variables like <test> which hold the conditions.
  2. We wrap their controlled blocks in a with statement. This allows us to track the variables read right before the with (= the controlling variables), and to restore the current controlling variables when the block is left.

A statement

if cond:
    body

thus becomes

if _data.test(cond):
    with _data:
        body
Tracking Control

To modify control statements, we traverse the tree, looking for If nodes:

class TrackControlTransformer(NodeTransformer):
    def visit_If(self, node: ast.If) -> ast.If:
        self.generic_visit(node)
        node.test = self.make_test(node.test)
        node.body = self.make_with(node.body)
        node.orelse = self.make_with(node.orelse)
        return node

The subtrees come from helper functions make_with() and make_test(). Again, all these subtrees are obtained via ast.dump().

class TrackControlTransformer(TrackControlTransformer):
    def make_with(self, block: List[ast.stmt]) -> List[ast.stmt]:
        """Create a subtree 'with _data: `block`'"""
        if len(block) == 0:
            return []

        block_as_text = ast.unparse(block[0])
        if block_as_text.startswith('with ' + DATA_TRACKER):
            return block  # Do not apply twice

        new_node = With(
            items=[
                withitem(
                    context_expr=Name(id=DATA_TRACKER, ctx=Load()),
                    optional_vars=None)
            ],
            body=block
        )
        ast.copy_location(new_node, block[0])
        return [new_node]
class TrackControlTransformer(TrackControlTransformer):
    def make_test(self, test: ast.expr) -> ast.expr:
        test_as_text = ast.unparse(test)
        if test_as_text.startswith(DATA_TRACKER + '.test'):
            return test  # Do not apply twice

        new_test = Call(func=Attribute(value=Name(id=DATA_TRACKER, ctx=Load()),
                                       attr='test',
                                       ctx=Load()),
                         args=[test],
                         keywords=[])
        ast.copy_location(new_test, test)
        return new_test

while loops are handled just like if constructs.

class TrackControlTransformer(TrackControlTransformer):
    def visit_While(self, node: ast.While) -> ast.While:
        self.generic_visit(node)
        node.test = self.make_test(node.test)
        node.body = self.make_with(node.body)
        node.orelse = self.make_with(node.orelse)
        return node

for loops gets a different treatment, as there is no condition that would control the body. Still, we ensure that setting the iterator variable is properly tracked.

class TrackControlTransformer(TrackControlTransformer):
    # regular `for` loop
    def visit_For(self, node: Union[ast.For, ast.AsyncFor]) -> AST:
        self.generic_visit(node)
        id = ast.unparse(node.target).strip()
        node.iter = make_set_data(id, node.iter)

        # Uncomment if you want iterators to control their bodies
        # node.body = self.make_with(node.body)
        # node.orelse = self.make_with(node.orelse)
        return node

    # `for` loops in async functions
    def visit_AsyncFor(self, node: ast.AsyncFor) -> AST:
        return self.visit_For(node)

    # `for` clause in comprehensions
    def visit_comprehension(self, node: ast.comprehension) -> AST:
        self.generic_visit(node)
        id = ast.unparse(node.target).strip()
        node.iter = make_set_data(id, node.iter)
        return node

Here is the effect of TrackControlTransformer:

TrackControlTransformer().visit(middle_tree)
dump_tree(middle_tree)
def middle(x, y, z):
    if _data.test(_data.get('y', y) < _data.get('z', z)):
        with _data:
            if _data.test(_data.get('x', x) < _data.get('y', y)):
                with _data:
                    return _data.set('<middle() return value>', _data.get('y', y))
            else:
                with _data:
                    if _data.test(_data.get('x', x) < _data.get('z', z)):
                        with _data:
                            return _data.set('<middle() return value>', _data.get('y', y))
    else:
        with _data:
            if _data.test(_data.get('x', x) > _data.get('y', y)):
                with _data:
                    return _data.set('<middle() return value>', _data.get('y', y))
            else:
                with _data:
                    if _data.test(_data.get('x', x) > _data.get('z', z)):
                        with _data:
                            return _data.set('<middle() return value>', _data.get('x', x))
    return _data.set('<middle() return value>', _data.get('z', z))

We extend DataTracker to also log these events:

class DataTracker(DataTracker):
    def test(self, cond: AST) -> AST:
        """Test condition `cond`. To be overloaded in subclasses."""
        if self.log:
            caller_func, lineno = self.caller_location()
            print(f"{caller_func.__name__}:{lineno}: testing condition")

        return cond
class DataTracker(DataTracker):
    def __enter__(self) -> Any:
        """Enter `with` block. To be overloaded in subclasses."""
        if self.log:
            caller_func, lineno = self.caller_location()
            print(f"{caller_func.__name__}:{lineno}: entering block")
        return self

    def __exit__(self, exc_type: Type, exc_value: BaseException, 
                 traceback: TracebackType) -> Optional[bool]:
        """Exit `with` block. To be overloaded in subclasses."""
        if self.log:
            caller_func, lineno = self.caller_location()
            print(f"{caller_func.__name__}:{lineno}: exiting block")
        return None
with DataTrackerTester(middle_tree, middle):
    middle(2, 1, 3)
middle:2: getting y
middle:2: getting z
middle:2: testing condition
middle:3: entering block
middle:3: getting x
middle:3: getting y
middle:3: testing condition
middle:5: entering block
middle:5: getting x
middle:5: getting z
middle:5: testing condition
middle:6: entering block
middle:6: getting y
middle:6: setting <middle() return value>
middle:6: exiting block
middle:5: exiting block
middle:3: exiting block

We also want to be able to track calls across multiple functions. To this end, we wrap each call

func(arg1, arg2, ...)

into

_data.ret(_data.call(func)(_data.arg(arg1), _data.arg(arg2), ...))

each of which simply pass through their given argument, but which allow tracking the beginning of calls (call()), the computation of arguments (arg()), and the return of the call (ret()), respectively.

Tracking Calls and Arguments

Our TrackCallTransformer visits all Call nodes, applying the transformations as shown above.

class TrackCallTransformer(NodeTransformer):
    def make_call(self, node: AST, func: str, 
                  pos: Optional[int] = None, kw: Optional[str] = None) -> Call:
        """Return _data.call(`func`)(`node`)"""
        keywords = []

        # `Num()` and `Str()` are deprecated in favor of `Constant()`
        if pos:
            keywords.append(keyword(arg='pos', value=ast.Num(pos)))
        if kw:
            keywords.append(keyword(arg='kw', value=ast.Str(kw)))

        return Call(func=Attribute(value=Name(id=DATA_TRACKER,
                                              ctx=Load()),
                                   attr=func,
                                   ctx=Load()),
                     args=[node],
                     keywords=keywords)

    def visit_Call(self, node: Call) -> Call:
        self.generic_visit(node)

        call_as_text = ast.unparse(node)
        if call_as_text.startswith(DATA_TRACKER + '.ret'):
            return node  # Already applied

        func_as_text = ast.unparse(node)
        if func_as_text.startswith(DATA_TRACKER + '.'):
            return node  # Own function

        new_args = []
        for n, arg in enumerate(node.args):
            new_args.append(self.make_call(arg, 'arg', pos=n + 1))
        node.args = cast(List[ast.expr], new_args)

        for kw in node.keywords:
            id = kw.arg if hasattr(kw, 'arg') else None
            kw.value = self.make_call(kw.value, 'arg', kw=id)

        node.func = self.make_call(node.func, 'call')
        return self.make_call(node, 'ret')

Our example function middle() does not contain any calls, but here is a function that invokes middle() twice:

def test_call() -> int:
    x = middle(1, 2, z=middle(1, 2, 3))
    return x
call_tree = ast.parse(inspect.getsource(test_call))
dump_tree(call_tree)
def test_call() -> int:
    x = middle(1, 2, z=middle(1, 2, 3))
    return x

If we invoke TrackCallTransformer on this testing function, we get the following transformed code:

TrackCallTransformer().visit(call_tree);
dump_tree(call_tree)
def test_call() -> int:
    x = _data.ret(_data.call(middle)(_data.arg(1, pos=1), _data.arg(2, pos=2), z=_data.arg(_data.ret(_data.call(middle)(_data.arg(1, pos=1), _data.arg(2, pos=2), _data.arg(3, pos=3))), kw='z')))
    return x
def f() -> bool:
    return math.isclose(1, 1.0)
f_tree = ast.parse(inspect.getsource(f))
dump_tree(f_tree)
def f() -> bool:
    return math.isclose(1, 1.0)
TrackCallTransformer().visit(f_tree);
dump_tree(f_tree)
def f() -> bool:
    return _data.ret(_data.call(math.isclose)(_data.arg(1, pos=1), _data.arg(1.0, pos=2)))

As before, our default arg(), ret(), and call() methods simply log the event and pass through the given value.

class DataTracker(DataTracker):
    def arg(self, value: Any, pos: Optional[int] = None, kw: Optional[str] = None) -> Any:
        """
        Track `value` being passed as argument.
        `pos` (if given) is the argument position (starting with 1).
        `kw` (if given) is the argument keyword.
        """

        if self.log:
            caller_func, lineno = self.caller_location()
            info = ""
            if pos:
                info += f" #{pos}"
            if kw:
                info += f" {repr(kw)}"

            print(f"{caller_func.__name__}:{lineno}: pushing arg{info}")

        return value
class DataTracker(DataTracker):
    def ret(self, value: Any) -> Any:
        """Track `value` being used as return value."""
        if self.log:
            caller_func, lineno = self.caller_location()
            print(f"{caller_func.__name__}:{lineno}: returned from call")

        return value
class DataTracker(DataTracker):
    def instrument_call(self, func: Callable) -> Callable:
        """Instrument a call to `func`. To be implemented in subclasses."""
        return func

    def call(self, func: Callable) -> Callable:
        """Track a call to `func`."""
        if self.log:
            caller_func, lineno = self.caller_location()
            print(f"{caller_func.__name__}:{lineno}: calling {func}")

        return self.instrument_call(func)
dump_tree(call_tree)
def test_call() -> int:
    x = _data.ret(_data.call(middle)(_data.arg(1, pos=1), _data.arg(2, pos=2), z=_data.arg(_data.ret(_data.call(middle)(_data.arg(1, pos=1), _data.arg(2, pos=2), _data.arg(3, pos=3))), kw='z')))
    return x
with DataTrackerTester(call_tree, test_call):
    test_call()
test_call:2: calling <function middle at 0x10a9cbf40>
test_call:2: pushing arg #1
test_call:2: pushing arg #2
test_call:2: calling <function middle at 0x10a9cbf40>
test_call:2: pushing arg #1
test_call:2: pushing arg #2
test_call:2: pushing arg #3
test_call:2: returned from call
test_call:2: pushing arg 'z'
test_call:2: returned from call
test_call()
2

On the receiving end, for each function argument x, we insert a call _data.param('x', x, [position info]) to initialize x. This is useful for tracking parameters across function calls.

Tracking Parameters

Again, we use ast.dump() to determine the correct syntax tree:

print(ast.dump(ast.parse("_data.param('x', x, pos=1, last=True)")))
Module(body=[Expr(value=Call(func=Attribute(value=Name(id='_data', ctx=Load()), attr='param', ctx=Load()), args=[Constant(value='x'), Name(id='x', ctx=Load())], keywords=[keyword(arg='pos', value=Constant(value=1)), keyword(arg='last', value=Constant(value=True))]))], type_ignores=[])
class TrackParamsTransformer(NodeTransformer):
    def visit_FunctionDef(self, node: ast.FunctionDef) -> ast.FunctionDef:
        self.generic_visit(node)

        named_args = []
        for child in ast.iter_child_nodes(node.args):
            if isinstance(child, ast.arg):
                named_args.append(child)

        create_stmts = []
        for n, child in enumerate(named_args):
            keywords=[keyword(arg='pos', value=ast.Num(n=n + 1))]
            if child is node.args.vararg:
                keywords.append(keyword(arg='vararg', value=ast.Str(s='*')))
            if child is node.args.kwarg:
                keywords.append(keyword(arg='vararg', value=ast.Str(s='**')))
            if n == len(named_args) - 1:
                keywords.append(keyword(arg='last',
                                        value=ast.NameConstant(value=True)))

            create_stmt = Expr(
                value=Call(
                    func=Attribute(value=Name(id=DATA_TRACKER, ctx=Load()),
                                   attr='param', ctx=Load()),
                    args=[ast.Str(s=child.arg),
                          Name(id=child.arg, ctx=Load())
                         ],
                    keywords=keywords
                )
            )
            ast.copy_location(create_stmt, node)
            create_stmts.append(create_stmt)

        node.body = cast(List[ast.stmt], create_stmts) + node.body
        return node

This is the effect of TrackParamsTransformer(). You see how the first three parameters are all initialized.

TrackParamsTransformer().visit(middle_tree)
dump_tree(middle_tree)
def middle(x, y, z):
    _data.param('x', x, pos=1)
    _data.param('y', y, pos=2)
    _data.param('z', z, pos=3, last=True)
    if _data.test(_data.get('y', y) < _data.get('z', z)):
        with _data:
            if _data.test(_data.get('x', x) < _data.get('y', y)):
                with _data:
                    return _data.set('<middle() return value>', _data.get('y', y))
            else:
                with _data:
                    if _data.test(_data.get('x', x) < _data.get('z', z)):
                        with _data:
                            return _data.set('<middle() return value>', _data.get('y', y))
    else:
        with _data:
            if _data.test(_data.get('x', x) > _data.get('y', y)):
                with _data:
                    return _data.set('<middle() return value>', _data.get('y', y))
            else:
                with _data:
                    if _data.test(_data.get('x', x) > _data.get('z', z)):
                        with _data:
                            return _data.set('<middle() return value>', _data.get('x', x))
    return _data.set('<middle() return value>', _data.get('z', z))

By default, the DataTracker param() method simply calls set() to set variables.

class DataTracker(DataTracker):
    def param(self, name: str, value: Any, 
              pos: Optional[int] = None, vararg: str = '', last: bool = False) -> Any:
        """
        At the beginning of a function, track parameter `name` being set to `value`.
        `pos` is the position of the argument (starting with 1).
        `vararg` is "*" if `name` is a vararg parameter (as in *args),
        and "**" is `name` is a kwargs parameter (as in *kwargs).
        `last` is True if `name` is the last parameter.
        """
        if self.log:
            caller_func, lineno = self.caller_location()
            info = ""
            if pos is not None:
                info += f" #{pos}"

            print(f"{caller_func.__name__}:{lineno}: initializing {vararg}{name}{info}")

        return self.set(name, value)
with DataTrackerTester(middle_tree, middle):
    middle(2, 1, 3)
middle:1: initializing x #1
middle:1: setting x
middle:1: initializing y #2
middle:1: setting y
middle:1: initializing z #3
middle:1: setting z
middle:2: getting y
middle:2: getting z
middle:2: testing condition
middle:3: entering block
middle:3: getting x
middle:3: getting y
middle:3: testing condition
middle:5: entering block
middle:5: getting x
middle:5: getting z
middle:5: testing condition
middle:6: entering block
middle:6: getting y
middle:6: setting <middle() return value>
middle:6: exiting block
middle:5: exiting block
middle:3: exiting block
def args_test(x, *args, **kwargs):
    print(x, *args, **kwargs)
args_tree = ast.parse(inspect.getsource(args_test))
TrackParamsTransformer().visit(args_tree)
dump_tree(args_tree)
def args_test(x, *args, **kwargs):
    _data.param('x', x, pos=1)
    _data.param('args', args, pos=2, vararg='*')
    _data.param('kwargs', kwargs, pos=3, vararg='**', last=True)
    print(x, *args, **kwargs)
with DataTrackerTester(args_tree, args_test):
    args_test(1, 2, 3)
args_test:1: initializing x #1
args_test:1: setting x
args_test:1: initializing *args #2
args_test:1: setting args
args_test:1: initializing **kwargs #3
args_test:1: setting kwargs
1 2 3

What do we obtain after we have applied all these transformers on middle()? We see that the code now contains quite a load of instrumentation.

dump_tree(middle_tree)
def middle(x, y, z):
    _data.param('x', x, pos=1)
    _data.param('y', y, pos=2)
    _data.param('z', z, pos=3, last=True)
    if _data.test(_data.get('y', y) < _data.get('z', z)):
        with _data:
            if _data.test(_data.get('x', x) < _data.get('y', y)):
                with _data:
                    return _data.set('<middle() return value>', _data.get('y', y))
            else:
                with _data:
                    if _data.test(_data.get('x', x) < _data.get('z', z)):
                        with _data:
                            return _data.set('<middle() return value>', _data.get('y', y))
    else:
        with _data:
            if _data.test(_data.get('x', x) > _data.get('y', y)):
                with _data:
                    return _data.set('<middle() return value>', _data.get('y', y))
            else:
                with _data:
                    if _data.test(_data.get('x', x) > _data.get('z', z)):
                        with _data:
                            return _data.set('<middle() return value>', _data.get('x', x))
    return _data.set('<middle() return value>', _data.get('z', z))

And when we execute this code, we see that we can track quite a number of events, while the code semantics stay unchanged.

with DataTrackerTester(middle_tree, middle):
    m = middle(2, 1, 3)
m
middle:1: initializing x #1
middle:1: setting x
middle:1: initializing y #2
middle:1: setting y
middle:1: initializing z #3
middle:1: setting z
middle:2: getting y
middle:2: getting z
middle:2: testing condition
middle:3: entering block
middle:3: getting x
middle:3: getting y
middle:3: testing condition
middle:5: entering block
middle:5: getting x
middle:5: getting z
middle:5: testing condition
middle:6: entering block
middle:6: getting y
middle:6: setting <middle() return value>
middle:6: exiting block
middle:5: exiting block
middle:3: exiting block
1
Transformer Stress Test

We stress test our transformers by instrumenting, transforming, and compiling a number of modules.

import Assertions  # minor dependency
import Debugger  # minor dependency
for module in [Assertions, Debugger, inspect, ast]:
    module_tree = ast.parse(inspect.getsource(module))
    assert isinstance(module_tree, ast.Module)

    TrackCallTransformer().visit(module_tree)
    TrackSetTransformer().visit(module_tree)
    TrackGetTransformer().visit(module_tree)
    TrackControlTransformer().visit(module_tree)
    TrackReturnTransformer().visit(module_tree)
    TrackParamsTransformer().visit(module_tree)
    # dump_tree(module_tree)
    ast.fix_missing_locations(module_tree)  # Must run this before compiling

    module_code = compile(module_tree, '<stress_test>', 'exec')
    print(f"{repr(module.__name__)} instrumented successfully.")
'Assertions' instrumented successfully.
'Debugger' instrumented successfully.
'inspect' instrumented successfully.
'ast' instrumented successfully.

Our next step will now be not only to log these events, but to actually construct dependencies from them.

Tracking Dependencies

To construct dependencies from variable accesses, we subclass DataTracker into DependencyTracker – a class that actually keeps track of all these dependencies. Its constructor initializes a number of variables we will discuss below.

class DependencyTracker(DataTracker):
    """Track dependencies during execution"""

    def __init__(self, *args: Any, **kwargs: Any) -> None:
        """Constructor. Arguments are passed to DataTracker.__init__()"""
        super().__init__(*args, **kwargs)

        self.origins: Dict[str, Location] = {}  # Where current variables were last set
        self.data_dependencies: Dependency = {}  # As with Dependencies, above
        self.control_dependencies: Dependency = {}

        self.last_read: List[str] = []  # List of last read variables
        self.last_checked_location = (StackInspector.unknown, 1)
        self._ignore_location_change = False

        self.data: List[List[str]] = [[]]  # Data stack
        self.control: List[List[str]] = [[]]  # Control stack

        self.frames: List[Dict[Union[int, str], Any]] = [{}]  # Argument stack
        self.args: Dict[Union[int, str], Any] = {}  # Current args

Data Dependencies

The first job of our DependencyTracker is to construct dependencies between read and written variables.

Reading Variables

As in DataTracker, the key method of DependencyTracker again is get(), invoked as _data.get('x', x) whenever a variable x is read. First and foremost, it appends the name of the read variable to the list last_read.

class DependencyTracker(DependencyTracker):
    def get(self, name: str, value: Any) -> Any:
        """Track a read access for variable `name` with value `value`"""
        self.check_location()
        self.last_read.append(name)
        return super().get(name, value)

    def check_location(self) -> None:
        pass  # More on that below
x = 5
y = 3
_test_data = DependencyTracker(log=True)
_test_data.get('x', x) + _test_data.get('y', y)
<cell line: 2>:2: getting x
<cell line: 2>:2: getting y
8
_test_data.last_read
['x', 'y']

Checking Locations

However, before appending the read variable to last_read, _data.get() does one more thing. By invoking check_location(), it clears the last_read list if we have reached a new line in the execution. This avoids situations such as

x
y
z = a + b

where x and y are, well, read, but do not affect the last line. Therefore, with every new line, the list of last read lines is cleared.

class DependencyTracker(DependencyTracker):
    def clear_read(self) -> None:
        """Clear set of read variables"""
        if self.log:
            direct_caller = inspect.currentframe().f_back.f_code.co_name
            caller_func, lineno = self.caller_location()
            print(f"{caller_func.__name__}:{lineno}: "
                  f"clearing read variables {self.last_read} "
                  f"(from {direct_caller})")

        self.last_read = []

    def check_location(self) -> None:
        """If we are in a new location, clear set of read variables"""
        location = self.caller_location()
        func, lineno = location
        last_func, last_lineno = self.last_checked_location

        if self.last_checked_location != location:
            if self._ignore_location_change:
                self._ignore_location_change = False
            elif func.__name__.startswith('<'):
                # Entering list comprehension, eval(), exec(), ...
                pass
            elif last_func.__name__.startswith('<'):
                # Exiting list comprehension, eval(), exec(), ...
                pass
            else:
                # Standard case
                self.clear_read()

        self.last_checked_location = location

Two methods can suppress this reset of the last_read list:

  • ignore_next_location_change() suppresses the reset for the next line. This is useful when returning from a function, when the return value is still in the list of "read" variables.
  • ignore_location_change() suppresses the reset for the current line. This is useful if we already have returned from a function call.
class DependencyTracker(DependencyTracker):
    def ignore_next_location_change(self) -> None:
        self._ignore_location_change = True

    def ignore_location_change(self) -> None:
        self.last_checked_location = self.caller_location()

Watch how DependencyTracker resets last_read when a new line is executed:

_test_data = DependencyTracker()
_test_data.get('x', x) + _test_data.get('y', y)
8
_test_data.last_read
['x', 'y']
a = 42
b = -1
_test_data.get('a', a) + _test_data.get('b', b)
41
_test_data.last_read
['x', 'y', 'a', 'b']

Setting Variables

The method set() creates dependencies. It is invoked as _data.set('x', value) whenever a variable x is set.

First and foremost, it takes the list of variables read last_read, and for each of the variables $v$, it takes their origin $o$ (the place where they were last set) and appends the pair ($v$, $o$) to the list of data dependencies. It then does a similar thing with control dependencies (more on these below), and finally marks (in self.origins) the current location of $v$.

import itertools
class DependencyTracker(DependencyTracker):
    TEST = '<test>'  # Name of pseudo-variables for testing conditions

    def set(self, name: str, value: Any, loads: Optional[Set[str]] = None) -> Any:
        """Add a dependency for `name` = `value`"""

        def add_dependencies(dependencies: Set[Node], 
                             vars_read: List[str], tp: str) -> None:
            """Add origins of `vars_read` to `dependencies`."""
            for var_read in vars_read:
                if var_read in self.origins:
                    if var_read == self.TEST and tp == "data":
                        # Can't have data dependencies on conditions
                        continue

                    origin = self.origins[var_read]
                    dependencies.add((var_read, origin))

                    if self.log:
                        origin_func, origin_lineno = origin
                        caller_func, lineno = self.caller_location()
                        print(f"{caller_func.__name__}:{lineno}: "
                              f"new {tp} dependency: "
                              f"{name} <= {var_read} "
                              f"({origin_func.__name__}:{origin_lineno})")

        self.check_location()
        ret = super().set(name, value)
        location = self.caller_location()

        add_dependencies(self.data_dependencies.setdefault
                         ((name, location), set()),
                         self.last_read, tp="data")
        add_dependencies(self.control_dependencies.setdefault
                         ((name, location), set()),
                         cast(List[str], itertools.chain.from_iterable(self.control)),
                         tp="control")

        self.origins[name] = location

        # Reset read info for next line
        self.last_read = [name]

        # Next line is a new location
        self._ignore_location_change = False

        return ret

    def dependencies(self) -> Dependencies:
        """Return dependencies"""
        return Dependencies(self.data_dependencies,
                            self.control_dependencies)

Let us illustrate set() by example. Here's a set of variables read and written:

_test_data = DependencyTracker()
x = _test_data.set('x', 1)
y = _test_data.set('y', _test_data.get('x', x))
z = _test_data.set('z', _test_data.get('x', x) + _test_data.get('y', y))

The attribute origins saves for each variable where it was last written:

_test_data.origins
{'x': (<function __main__.<cell line: 2>()>, 2),
 'y': (<function __main__.<cell line: 3>()>, 3),
 'z': (<function __main__.<cell line: 4>()>, 4)}

The attribute data_dependencies tracks for each variable the variables it was read from:

_test_data.data_dependencies
{('x', (<function __main__.<cell line: 2>()>, 2)): set(),
 ('y',
  (<function __main__.<cell line: 3>()>, 3)): {('x',
   (<function __main__.<cell line: 2>()>, 2))},
 ('z',
  (<function __main__.<cell line: 4>()>, 4)): {('x',
   (<function __main__.<cell line: 2>()>, 2)), ('y',
   (<function __main__.<cell line: 3>()>, 3))}}

Hence, the above code already gives us a small dependency graph:

dependencies x_functioncellline_2at0x10c97bb50_2 x <cell line: 2> z_functioncellline_4at0x10ea488b0_4 z <cell line: 4> x_functioncellline_2at0x10c97bb50_2->z_functioncellline_4at0x10ea488b0_4 y_functioncellline_3at0x10ea48550_3 y <cell line: 3> x_functioncellline_2at0x10c97bb50_2->y_functioncellline_3at0x10ea48550_3 y_functioncellline_3at0x10ea48550_3->z_functioncellline_4at0x10ea488b0_4

In the remainder of this section, we define methods to

  • track control dependencies (test(), __enter__(), __exit__())
  • track function calls and returns (call(), ret())
  • track function arguments (arg(), param())
  • check the validity of our dependencies (validate()).

Like our get() and set() methods above, these work by refining the appropriate methods defined in the DataTracker class, building on our NodeTransformer transformations.

Control Dependencies

Let us detail control dependencies. As discussed with DataTracker(), we invoke test() methods for all control conditions, and place the controlled blocks into with clauses.

The test() method simply sets a <test> variable; this also places it in last_read.

class DependencyTracker(DependencyTracker):
    def test(self, value: Any) -> Any:
        """Track a test for condition `value`"""
        self.set(self.TEST, value)
        return super().test(value)

When entering a with block, the set of last_read variables holds the <test> variable read. We save it in the control stack, with the effect of any further variables written now being marked as controlled by <test>.

class DependencyTracker(DependencyTracker):
    def __enter__(self) -> Any:
        """Track entering an if/while/for block"""
        self.control.append(self.last_read)
        self.clear_read()
        return super().__enter__()

When we exit the with block, we restore earlier last_read values, preparing for else blocks.

class DependencyTracker(DependencyTracker):
    def __exit__(self, exc_type: Type, exc_value: BaseException,
                 traceback: TracebackType) -> Optional[bool]:
        """Track exiting an if/while/for block"""
        self.clear_read()
        self.last_read = self.control.pop()
        self.ignore_next_location_change()
        return super().__exit__(exc_type, exc_value, traceback)

Here's an example of all these parts in action:

_test_data = DependencyTracker()
x = _test_data.set('x', 1)
y = _test_data.set('y', _test_data.get('x', x))
if _test_data.test(_test_data.get('x', x) >= _test_data.get('y', y)):
    with _test_data:
        z = _test_data.set('z',
                           _test_data.get('x', x) + _test_data.get('y', y))
_test_data.control_dependencies
{('x', (<function __main__.<cell line: 2>()>, 2)): set(),
 ('y', (<function __main__.<cell line: 3>()>, 3)): set(),
 ('<test>', (<function __main__.<cell line: 1>()>, 1)): set(),
 ('z',
  (<function __main__.<cell line: 1>()>, 3)): {('<test>',
   (<function __main__.<cell line: 1>()>, 1))}}

The control dependency for z is reflected in the dependency graph:

dependencies x_functioncellline_2at0x10c97bb50_2 x <cell line: 2> z_functioncellline_1at0x10ea48ee0_3 z <cell line: 1> x_functioncellline_2at0x10c97bb50_2->z_functioncellline_1at0x10ea48ee0_3 y_functioncellline_3at0x10ea48550_3 y <cell line: 3> x_functioncellline_2at0x10c97bb50_2->y_functioncellline_3at0x10ea48550_3 test_functioncellline_1at0x10c97b640_1 <test> _test_data.get('x', x) x_functioncellline_2at0x10c97bb50_2->test_functioncellline_1at0x10c97b640_1 y_functioncellline_3at0x10ea48550_3->z_functioncellline_1at0x10ea48ee0_3 y_functioncellline_3at0x10ea48550_3->test_functioncellline_1at0x10c97b640_1 test_functioncellline_1at0x10c97b640_1->z_functioncellline_1at0x10ea48ee0_3
Calls and Returns
import copy

To handle complex expressions involving functions, we introduce a data stack. Every time we invoke a function func (call() is invoked), we save the list of current variables read last_read on the data stack; when we return (ret() is invoked), we restore last_read. This also ensures that only those variables read while evaluating arguments will flow into the function call.

class DependencyTracker(DependencyTracker):
    def call(self, func: Callable) -> Callable:
        """Track a call of function `func`"""
        func = super().call(func)

        if inspect.isgeneratorfunction(func):
            return self.call_generator(func)

        # Save context
        if self.log:
            caller_func, lineno = self.caller_location()
            print(f"{caller_func.__name__}:{lineno}: "
                  f"saving read variables {self.last_read}")

        self.data.append(self.last_read)
        self.clear_read()
        self.ignore_next_location_change()

        self.frames.append(self.args)
        self.args = {}

        return func
class DependencyTracker(DependencyTracker):
    def ret(self, value: Any) -> Any:
        """Track a function return"""
        value = super().ret(value)

        if self.in_generator():
            return self.ret_generator(value)

        # Restore old context and add return value
        ret_name = None
        for var in self.last_read:
            if var.startswith("<"):  # "<return value>"
                ret_name = var

        self.last_read = self.data.pop()
        if ret_name:
            self.last_read.append(ret_name)

        if self.args:
            # We return from an uninstrumented function:
            # Make return value depend on all args
            for key, deps in self.args.items():
                self.last_read += deps

        self.ignore_location_change()

        self.args = self.frames.pop()

        if self.log:
            caller_func, lineno = self.caller_location()
            print(f"{caller_func.__name__}:{lineno}: "
                  f"restored read variables {self.last_read}")

        return value

Generator functions (those which yield a value) are not "called" in the sense that Python transfers control to them; instead, a "call" to a generator function creates a generator that is evaluated on demand. We mark generator function "calls" by saving None on the stacks. When the generator function returns the generator, we wrap the generator such that the arguments are being restored when it is invoked.

class DependencyTracker(DependencyTracker):
    def in_generator(self) -> bool:
        """True if we are calling a generator function"""
        return len(self.data) > 0 and self.data[-1] is None

    def call_generator(self, func: Callable) -> Callable:
        """Track a call of a generator function"""
        # Mark the fact that we're in a generator with `None` values
        self.data.append(None)
        self.frames.append(None)
        assert self.in_generator()

        self.clear_read()
        return func

    def ret_generator(self, generator: Any) -> Any:
        """Track the return of a generator function"""
        # Pop the two 'None' values pushed earlier
        self.data.pop()
        self.frames.pop()

        if self.log:
            caller_func, lineno = self.caller_location()
            print(f"{caller_func.__name__}:{lineno}: "
                  f"wrapping generator {generator} (args={self.args})")

        # At this point, we already have collected the args.
        # The returned generator depends on all of them.
        for arg in self.args:
            self.last_read += self.args[arg]

        # Wrap the generator such that the args are restored 
        # when it is actually invoked, such that we can map them
        # to parameters.
        saved_args = copy.deepcopy(self.args)

        def wrapper() -> Generator[Any, None, None]:
            self.args = saved_args
            if self.log:
                caller_func, lineno = self.caller_location()
                print(f"{caller_func.__name__}:{lineno}: "
                  f"calling generator (args={self.args})")

            self.ignore_next_location_change()
            yield from generator

        return wrapper()

We see an example of how function calls and returns work in conjunction with function arguments, discussed in the next section.

Function Arguments

Finally, we handle parameters and arguments. The args stack holds the current stack of function arguments, holding the last_read variable for each argument.

class DependencyTracker(DependencyTracker):
    def arg(self, value: Any, pos: Optional[int] = None, kw: Optional[str] = None) -> Any:
        """
        Track passing an argument `value`
        (with given position `pos` 1..n or keyword `kw`)
        """
        if self.log:
            caller_func, lineno = self.caller_location()
            print(f"{caller_func.__name__}:{lineno}: "
                  f"saving args read {self.last_read}")

        if pos:
            self.args[pos] = self.last_read
        if kw:
            self.args[kw] = self.last_read

        self.clear_read()
        return super().arg(value, pos, kw)

When accessing the arguments (with param()), we can retrieve this set of read variables for each argument.

class DependencyTracker(DependencyTracker):
    def param(self, name: str, value: Any,
              pos: Optional[int] = None, vararg: str = "", last: bool = False) -> Any:
        """
        Track getting a parameter `name` with value `value`
        (with given position `pos`).
        vararg parameters are indicated by setting `varargs` to 
        '*' (*args) or '**' (**kwargs)
        """
        self.clear_read()

        if vararg == '*':
            # We over-approximate by setting `args` to _all_ positional args
            for index in self.args:
                if isinstance(index, int) and pos is not None and index >= pos:
                    self.last_read += self.args[index]
        elif vararg == '**':
            # We over-approximate by setting `kwargs` to _all_ passed keyword args
            for index in self.args:
                if isinstance(index, str):
                    self.last_read += self.args[index]
        elif name in self.args:
            self.last_read = self.args[name]
        elif pos in self.args:
            self.last_read = self.args[pos]

        if self.log:
            caller_func, lineno = self.caller_location()
            print(f"{caller_func.__name__}:{lineno}: "
                  f"restored params read {self.last_read}")

        self.ignore_location_change()
        ret = super().param(name, value, pos)

        if last:
            self.clear_read()
            self.args = {}  # Mark `args` as processed

        return ret

Let us illustrate all these on a small example.

def call_test() -> int:
    c = 47

    def sq(n: int) -> int:
        return n * n

    def gen(e: int) -> Generator[int, None, None]:
        yield e * c

    def just_x(x: Any, y: Any) -> Any:
        return x

    a = 42
    b = gen(a)
    d = list(b)[0]

    xs = [1, 2, 3, 4]
    ys = [sq(elem) for elem in xs if elem > 2]

    return just_x(just_x(d, y=b), ys[0])
call_test()
1974

We apply all our transformers on this code:

call_tree = ast.parse(inspect.getsource(call_test))
TrackCallTransformer().visit(call_tree)
TrackSetTransformer().visit(call_tree)
TrackGetTransformer().visit(call_tree)
TrackControlTransformer().visit(call_tree)
TrackReturnTransformer().visit(call_tree)
TrackParamsTransformer().visit(call_tree)
dump_tree(call_tree)
def call_test() -> int:
    c = _data.set('c', 47)

    def sq(n: int) -> int:
        _data.param('n', n, pos=1, last=True)
        return _data.set('<sq() return value>', _data.get('n', n) * _data.get('n', n))

    def gen(e: int) -> Generator[int, None, None]:
        _data.param('e', e, pos=1, last=True)
        yield _data.set('<gen() yield value>', _data.get('e', e) * _data.get('c', c))

    def just_x(x: Any, y: Any) -> Any:
        _data.param('x', x, pos=1)
        _data.param('y', y, pos=2, last=True)
        return _data.set('<just_x() return value>', _data.get('x', x))
    a = _data.set('a', 42)
    b = _data.set('b', _data.ret(_data.call(_data.get('gen', gen))(_data.arg(_data.get('a', a), pos=1))))
    d = _data.set('d', _data.ret(_data.call(list)(_data.arg(_data.get('b', b), pos=1)))[0])
    xs = _data.set('xs', [1, 2, 3, 4])
    ys = _data.set('ys', [_data.ret(_data.call(_data.get('sq', sq))(_data.arg(_data.get('elem', elem), pos=1))) for elem in _data.set('elem', _data.get('xs', xs)) if _data.get('elem', elem) > 2])
    return _data.set('<call_test() return value>', _data.ret(_data.call(_data.get('just_x', just_x))(_data.arg(_data.ret(_data.call(_data.get('just_x', just_x))(_data.arg(_data.get('d', d), pos=1), y=_data.arg(_data.get('b', b), kw='y'))), pos=1), _data.arg(_data.get('ys', ys)[0], pos=2))))

Again, we capture the dependencies:

class DependencyTrackerTester(DataTrackerTester):
    def make_data_tracker(self) -> DependencyTracker:
        return DependencyTracker(log=self.log)
with DependencyTrackerTester(call_tree, call_test, log=False) as call_deps:
    call_test()

We see how

  • a flows into the generator b and into the parameter e of gen().
  • xs flows into elem which in turn flows into the parameter n of sq(). Both flow into ys.
  • just_x() returns only its x argument.
call_deps.dependencies()
dependencies xs_functioncall_testat0x10ea49750_17 xs xs = [1, 2, 3, 4] elem_functioncall_testat0x10ea49750_18 elem ys = [sq(elem) for elem in xs if elem > 2] xs_functioncall_testat0x10ea49750_17->elem_functioncall_testat0x10ea49750_18 sqreturnvalue_functioncall_testlocalssqat0x10ea49870_5 <sq() return value> return n * n n_functioncall_testlocalssqat0x10ea49870_4 n def sq(n: int) -> int: n_functioncall_testlocalssqat0x10ea49870_4->sqreturnvalue_functioncall_testlocalssqat0x10ea49870_5 ys_functioncall_testat0x10ea49750_18 ys ys = [sq(elem) for elem in xs if elem > 2] y_functioncall_testlocalsjust_xat0x10ea49990_10 y def just_x(x: Any, y: Any) -> Any: ys_functioncall_testat0x10ea49750_18->y_functioncall_testlocalsjust_xat0x10ea49990_10 call_testreturnvalue_functioncall_testat0x10ea49750_20 <call_test() return value> return just_x(just_x(d, y=b), ys[0]) e_functioncall_testlocalsgenat0x10ea49900_7 e def gen(e: int) -> Generator[int, None, None]: genyieldvalue_functioncall_testlocalsgenat0x10ea49900_8 <gen() yield value> yield e * c e_functioncall_testlocalsgenat0x10ea49900_7->genyieldvalue_functioncall_testlocalsgenat0x10ea49900_8 a_functioncall_testat0x10ea49750_13 a a = 42 a_functioncall_testat0x10ea49750_13->e_functioncall_testlocalsgenat0x10ea49900_7 b_functioncall_testat0x10ea49750_14 b b = gen(a) a_functioncall_testat0x10ea49750_13->b_functioncall_testat0x10ea49750_14 d_functioncall_testat0x10ea49750_15 d d = list(b)[0] x_functioncall_testlocalsjust_xat0x10ea49990_10 x def just_x(x: Any, y: Any) -> Any: d_functioncall_testat0x10ea49750_15->x_functioncall_testlocalsjust_xat0x10ea49990_10 genyieldvalue_functioncall_testlocalsgenat0x10ea49900_8->d_functioncall_testat0x10ea49750_15 b_functioncall_testat0x10ea49750_14->d_functioncall_testat0x10ea49750_15 b_functioncall_testat0x10ea49750_14->y_functioncall_testlocalsjust_xat0x10ea49990_10 elem_functioncall_testat0x10ea49750_18->n_functioncall_testlocalssqat0x10ea49870_4 just_xreturnvalue_functioncall_testlocalsjust_xat0x10ea49990_11 <just_x() return value> return x just_xreturnvalue_functioncall_testlocalsjust_xat0x10ea49990_11->x_functioncall_testlocalsjust_xat0x10ea49990_10 just_xreturnvalue_functioncall_testlocalsjust_xat0x10ea49990_11->call_testreturnvalue_functioncall_testat0x10ea49750_20 x_functioncall_testlocalsjust_xat0x10ea49990_10->just_xreturnvalue_functioncall_testlocalsjust_xat0x10ea49990_11 c_functioncall_testat0x10ea49750_2 c c = 47 c_functioncall_testat0x10ea49750_2->genyieldvalue_functioncall_testlocalsgenat0x10ea49900_8

The code() view lists each function separately:

call_deps.dependencies().code()
     1 def call_test() -> int:
*    2     c = 47
     3 
     4     def sq(n: int) -> int:
     5         return n * n
     6 
     7     def gen(e: int) -> Generator[int, None, None]:
     8         yield e * c
     9 
    10     def just_x(x: Any, y: Any) -> Any:
    11         return x
    12 
*   13     a = 42
*   14     b = gen(a)  # <= a (13)
*   15     d = list(b)[0]  # <= <gen() yield value> (gen:8), b (14)
    16 
*   17     xs = [1, 2, 3, 4]
*   18     ys = [sq(elem) for elem in xs if elem > 2]  # <= xs (17)
    19 
*   20     return just_x(just_x(d, y=b), ys[0])  # <= <just_x() return value> (just_x:11)

*    4     def sq(n: int) -> int:  # <= elem (call_test:18)
*    5         return n * n  # <= n (4)

*    7     def gen(e: int) -> Generator[int, None, None]:  # <= a (call_test:13)
*    8         yield e * c  # <= e (7), c (call_test:2)

*   10     def just_x(x: Any, y: Any) -> Any:  # <= d (call_test:15), <just_x() return value> (11), ys (call_test:18), b (call_test:14)
*   11         return x  # <= x (10)
Diagnostics

To check the dependencies we obtain, we perform some minimal checks on whether a referenced variable actually also occurs in the source code.

import re

validate() is automatically called whenever dependencies are output, so if you see any of its error messages, something may be wrong.

At this point, DependencyTracker is complete; we have all in place to track even complex dependencies in instrumented code.

Slicing Code

Let us now put all these pieces together. We have a means to instrument the source code (our various NodeTransformer classes) and a means to track dependencies (the DependencyTracker class). Now comes the time to put all these things together in a single tool, which we call Slicer.

The basic idea of Slicer is that you can use it as follows:

with Slicer(func_1, func_2, ...) as slicer:
    func(...)

which first instruments the functions given in the constructor (i.e., replaces their definitions with instrumented counterparts), and then runs the code in the body, calling instrumented functions, and allowing the slicer to collect dependencies. When the body returns, the original definition of the instrumented functions is restored.

An Instrumenter Base Class

The basic functionality of instrumenting a number of functions (and restoring them at the end of the with block) comes in a Instrumenter base class. It invokes instrument() on all items to instrument; this is to be overloaded in subclasses.

class Instrumenter(StackInspector):
    """Instrument functions for dynamic tracking"""

    def __init__(self, *items_to_instrument: Callable,
                 globals: Optional[Dict[str, Any]] = None,
                 log: Union[bool, int] = False) -> None:
        """
        Create an instrumenter.
        `items_to_instrument` is a list of items to instrument.
        `globals` is a namespace to use (default: caller's globals())
        """

        self.log = log
        self.items_to_instrument: List[Callable] = list(items_to_instrument)
        self.instrumented_items: Set[Any] = set()

        if globals is None:
            globals = self.caller_globals()
        self.globals = globals

    def __enter__(self) -> Any:
        """Instrument sources"""
        items = self.items_to_instrument
        if not items:
            items = self.default_items_to_instrument()

        for item in items:
            self.instrument(item)

        return self

    def default_items_to_instrument(self) -> List[Callable]:
        return []

    def instrument(self, item: Any) -> Any:
        """Instrument `item`. To be overloaded in subclasses."""
        if self.log:
            print("Instrumenting", item)
        self.instrumented_items.add(item)
        return item

At the end of the with block, we restore the given functions.

class Instrumenter(Instrumenter):
    def __exit__(self, exc_type: Type, exc_value: BaseException,
                 traceback: TracebackType) -> Optional[bool]:
        """Restore sources"""
        self.restore()
        return None

    def restore(self) -> None:
        for item in self.instrumented_items:
            self.globals[item.__name__] = item

By default, an Instrumenter simply outputs a log message:

with Instrumenter(middle, log=True) as ins:
    pass
Instrumenting <function middle at 0x10a9cbf40>

The Slicer Class

The Slicer class comes as a subclass of Instrumenter. It sets its own dependency tracker (which can be overwritten by setting the dependency_tracker keyword argument).

class Slicer(Instrumenter):
    """Track dependencies in an execution"""

    def __init__(self, *items_to_instrument: Any,
                 dependency_tracker: Optional[DependencyTracker] = None,
                 globals: Optional[Dict[str, Any]] = None,
                 log: Union[bool, int] = False):
        """Create a slicer.
        `items_to_instrument` are Python functions or modules with source code.
        `dependency_tracker` is the tracker to be used (default: DependencyTracker).
        `globals` is the namespace to be used(default: caller's `globals()`)
        `log`=True or `log` > 0 turns on logging
        """
        super().__init__(*items_to_instrument, globals=globals, log=log)

        if dependency_tracker is None:
            dependency_tracker = DependencyTracker(log=(log > 1))
        self.dependency_tracker = dependency_tracker

        self.saved_dependencies = None

    def default_items_to_instrument(self) -> List[Callable]:
        raise ValueError("Need one or more items to instrument")

The parse() method parses a given item, returning its AST.

class Slicer(Slicer):
    def parse(self, item: Any) -> AST:
        """Parse `item`, returning its AST"""
        source_lines, lineno = inspect.getsourcelines(item)
        source = "".join(source_lines)

        if self.log >= 2:
            print_content(source, '.py', start_line_number=lineno)
            print()
            print()

        tree = ast.parse(source)
        ast.increment_lineno(tree, lineno - 1)
        return tree

The transform() method applies the list of transformers defined earlier in this chapter.

class Slicer(Slicer):
    def transformers(self) -> List[NodeTransformer]:
        """List of transformers to apply. To be extended in subclasses."""
        return [
            TrackCallTransformer(),
            TrackSetTransformer(),
            TrackGetTransformer(),
            TrackControlTransformer(),
            TrackReturnTransformer(),
            TrackParamsTransformer()
        ]

    def transform(self, tree: AST) -> AST:
        """Apply transformers on `tree`. May be extended in subclasses."""
        # Apply transformers
        for transformer in self.transformers():
            if self.log >= 3:
                print(transformer.__class__.__name__ + ':')

            transformer.visit(tree)
            ast.fix_missing_locations(tree)
            if self.log >= 3:
                print_content(ast.unparse(tree), '.py')
                print()
                print()

        if 0 < self.log < 3:
            print_content(ast.unparse(tree), '.py')
            print()
            print()

        return tree

The execute() method executes the transformed tree (such that we get the new definitions). We also make the dependency tracker available for the code in the with block.

class Slicer(Slicer):
    def execute(self, tree: AST, item: Any) -> None:
        """Compile and execute `tree`. May be extended in subclasses."""

        # We pass the source file of `item` such that we can retrieve it
        # when accessing the location of the new compiled code
        source = cast(str, inspect.getsourcefile(item))
        code = compile(cast(ast.Module, tree), source, 'exec')

        # Enable dependency tracker
        self.globals[DATA_TRACKER] = self.dependency_tracker

        # Execute the code, resulting in a redefinition of item
        exec(code, self.globals)

The instrument() method puts all these together, first parsing the item into a tree, then transforming and executing the tree.

class Slicer(Slicer):
    def instrument(self, item: Any) -> Any:
        """Instrument `item`, transforming its source code, and re-defining it."""
        if is_internal(item.__name__):
            return item  # Do not instrument `print()` and the like

        if inspect.isbuiltin(item):
            return item  # No source code

        item = super().instrument(item)
        tree = self.parse(item)
        tree = self.transform(tree)
        self.execute(tree, item)

        new_item = self.globals[item.__name__]
        return new_item

When we restore the original definition (after the with block), we save the dependency tracker again.

class Slicer(Slicer):
    def restore(self) -> None:
        """Restore original code."""
        if DATA_TRACKER in self.globals:
            self.saved_dependencies = self.globals[DATA_TRACKER]
            del self.globals[DATA_TRACKER]

        super().restore()

Three convenience functions allow us to see the dependencies as (well) dependencies, as code, and as graph. These simply invoke the respective functions on the saved dependencies.

class Slicer(Slicer):
    def dependencies(self) -> Dependencies:
        """Return collected dependencies."""
        if self.saved_dependencies is None:
            return Dependencies({}, {})
        return self.saved_dependencies.dependencies()

    def code(self, *args: Any, **kwargs: Any) -> None:
        """Show code of instrumented items, annotated with dependencies."""
        first = True
        for item in self.instrumented_items:
            if not first:
                print()
            self.dependencies().code(item, *args, **kwargs)
            first = False

    def graph(self, *args: Any, **kwargs: Any) -> Digraph:
        """Show dependency graph."""
        return self.dependencies().graph(*args, **kwargs)

    def _repr_mimebundle_(self, include: Any = None, exclude: Any = None) -> Any:
        """If the object is output in Jupyter, render dependencies as a SVG graph"""
        return self.graph()._repr_mimebundle_(include, exclude)

Let us put Slicer into action. We track our middle() function:

with Slicer(middle) as slicer:
    m = middle(2, 1, 3)
m
1

These are the dependencies in string form (used when printed):

print(slicer.dependencies())
middle():
    <test> (2) <= z (1), y (1)
    <test> (3) <= x (1), y (1); <- <test> (2)
    <test> (5) <= z (1), x (1); <- <test> (3)
    <middle() return value> (6) <= y (1); <- <test> (5)

This is the code form:

slicer.code()
*    1 def middle(x, y, z):  # type: ignore
*    2     if y < z:  # <= z (1), y (1)
*    3         if x < y:  # <= x (1), y (1); <- <test> (2)
     4             return y
*    5         elif x < z:  # <= z (1), x (1); <- <test> (3)
*    6             return y  # <= y (1); <- <test> (5)
     7     else:
     8         if x > y:
     9             return y
    10         elif x > z:
    11             return x
    12     return z

And this is the graph form:

slicer
dependencies middlereturnvalue_functionmiddleat0x10ea4b880_6 <middle() return value> return y y_functionmiddleat0x10ea4b880_1 y def middle(x, y, z):  # type: ignore y_functionmiddleat0x10ea4b880_1->middlereturnvalue_functionmiddleat0x10ea4b880_6 test_functionmiddleat0x10ea4b880_3 <test> if x < y: y_functionmiddleat0x10ea4b880_1->test_functionmiddleat0x10ea4b880_3 test_functionmiddleat0x10ea4b880_2 <test> if y < z: y_functionmiddleat0x10ea4b880_1->test_functionmiddleat0x10ea4b880_2 test_functionmiddleat0x10ea4b880_5 <test> elif x < z: test_functionmiddleat0x10ea4b880_5->middlereturnvalue_functionmiddleat0x10ea4b880_6 z_functionmiddleat0x10ea4b880_1 z def middle(x, y, z):  # type: ignore z_functionmiddleat0x10ea4b880_1->test_functionmiddleat0x10ea4b880_5 z_functionmiddleat0x10ea4b880_1->test_functionmiddleat0x10ea4b880_2 x_functionmiddleat0x10ea4b880_1 x def middle(x, y, z):  # type: ignore x_functionmiddleat0x10ea4b880_1->test_functionmiddleat0x10ea4b880_5 x_functionmiddleat0x10ea4b880_1->test_functionmiddleat0x10ea4b880_3 test_functionmiddleat0x10ea4b880_3->test_functionmiddleat0x10ea4b880_5 test_functionmiddleat0x10ea4b880_2->test_functionmiddleat0x10ea4b880_3

You can also access the raw repr() form, which allows you to reconstruct dependencies at any time. (This is how we showed off dependencies at the beginning of this chapter, before even introducing the code that computes them.)

print(repr(slicer.dependencies()))
Dependencies(
    data={
        ('x', (middle, 1)): set(),
        ('y', (middle, 1)): set(),
        ('z', (middle, 1)): set(),
        ('<test>', (middle, 2)): {('z', (middle, 1)), ('y', (middle, 1))},
        ('<test>', (middle, 3)): {('x', (middle, 1)), ('y', (middle, 1))},
        ('<test>', (middle, 5)): {('z', (middle, 1)), ('x', (middle, 1))},
        ('<middle() return value>', (middle, 6)): {('y', (middle, 1))}},
 control={
        ('x', (middle, 1)): set(),
        ('y', (middle, 1)): set(),
        ('z', (middle, 1)): set(),
        ('<test>', (middle, 2)): set(),
        ('<test>', (middle, 3)): {('<test>', (middle, 2))},
        ('<test>', (middle, 5)): {('<test>', (middle, 3))},
        ('<middle() return value>', (middle, 6)): {('<test>', (middle, 5))}})

Diagnostics

The Slicer constructor accepts a log argument (default: False), which can be set to show various intermediate results:

  • log=True (or log=1): Show instrumented source code
  • log=2: Also log execution
  • log=3: Also log individual transformer steps
  • log=4: Also log source line numbers

More Examples

Let us demonstrate our Slicer class on a few more examples.

Square Root

The square_root() function from the chapter on assertions demonstrates a nice interplay between data and control dependencies.

import math
from Assertions import square_root  # minor dependency

Here is the original source code:

print_content(inspect.getsource(square_root), '.py')
def square_root(x):  # type: ignore
    assert x >= 0  # precondition

    approx = None
    guess = x / 2
    while approx != guess:
        approx = guess
        guess = (approx + x / approx) / 2

    assert math.isclose(approx * approx, x)
    return approx

Turning on logging shows the instrumented version:

with Slicer(square_root, log=True) as root_slicer:
    y = square_root(2.0)
Instrumenting <function square_root at 0x10cadfac0>
def square_root(x):
    _data.param('x', x, pos=1, last=True)
    assert _data.set('<assertion>', _data.get('x', x) >= 0, loads=(_data.get('x', x),))
    approx = _data.set('approx', None)
    guess = _data.set('guess', _data.get('x', x) / 2)
    while _data.test(_data.get('approx', approx) != _data.get('guess', guess)):
        with _data:
            approx = _data.set('approx', _data.get('guess', guess))
            guess = _data.set('guess', (_data.get('approx', approx) + _data.get('x', x) / _data.get('approx', approx)) / 2)
    assert _data.set('<assertion>', _data.ret(_data.call(_data.get('math', math).isclose)(_data.arg(_data.get('approx', approx) * _data.get('approx', approx), pos=1), _data.arg(_data.get('x', x), pos=2))), loads=(_data.get('approx', approx), _data.get('math', math), _data.get('x', x), _data))
    return _data.set('<square_root() return value>', _data.get('approx', approx))

The dependency graph shows how guess and approx flow into each other until they are the same.

root_slicer
dependencies square_rootreturnvalue_functionsquare_rootat0x10ea492d0_64 <square_root() return value> return approx approx_functionsquare_rootat0x10ea492d0_60 approx approx = guess approx_functionsquare_rootat0x10ea492d0_60->square_rootreturnvalue_functionsquare_rootat0x10ea492d0_64 guess_functionsquare_rootat0x10ea492d0_61 guess guess = (approx + x / approx) / 2 approx_functionsquare_rootat0x10ea492d0_60->guess_functionsquare_rootat0x10ea492d0_61 test_functionsquare_rootat0x10ea492d0_59 <test> while approx != guess: approx_functionsquare_rootat0x10ea492d0_60->test_functionsquare_rootat0x10ea492d0_59 assertion_functionsquare_rootat0x10ea492d0_63 <assertion> assert math.isclose(approx * approx, x) approx_functionsquare_rootat0x10ea492d0_60->assertion_functionsquare_rootat0x10ea492d0_63 x_functionsquare_rootat0x10ea492d0_54 x def square_root(x):  # type: ignore guess_functionsquare_rootat0x10ea492d0_58 guess guess = x / 2 x_functionsquare_rootat0x10ea492d0_54->guess_functionsquare_rootat0x10ea492d0_58 assertion_functionsquare_rootat0x10ea492d0_55 <assertion> assert x >= 0  # precondition x_functionsquare_rootat0x10ea492d0_54->assertion_functionsquare_rootat0x10ea492d0_55 x_functionsquare_rootat0x10ea492d0_54->guess_functionsquare_rootat0x10ea492d0_61 x_functionsquare_rootat0x10ea492d0_54->assertion_functionsquare_rootat0x10ea492d0_63 guess_functionsquare_rootat0x10ea492d0_58->approx_functionsquare_rootat0x10ea492d0_60 guess_functionsquare_rootat0x10ea492d0_58->test_functionsquare_rootat0x10ea492d0_59 approx_functionsquare_rootat0x10ea492d0_57 approx approx = None guess_functionsquare_rootat0x10ea492d0_61->approx_functionsquare_rootat0x10ea492d0_60 guess_functionsquare_rootat0x10ea492d0_61->test_functionsquare_rootat0x10ea492d0_59 test_functionsquare_rootat0x10ea492d0_59->approx_functionsquare_rootat0x10ea492d0_60 test_functionsquare_rootat0x10ea492d0_59->guess_functionsquare_rootat0x10ea492d0_61 approx_functionsquare_rootat0x10ea492d0_57->test_functionsquare_rootat0x10ea492d0_59

Again, we can show the code annotated with dependencies:

root_slicer.code()
*   54 def square_root(x):  # type: ignore
*   55     assert x >= 0  # precondition  # <= x (54)
    56 
*   57     approx = None
*   58     guess = x / 2  # <= x (54)
*   59     while approx != guess:  # <= guess (61), guess (58), approx (60), approx (57)
*   60         approx = guess  # <= guess (58), guess (61); <- <test> (59)
*   61         guess = (approx + x / approx) / 2  # <= approx (60), x (54); <- <test> (59)
    62 
*   63     assert math.isclose(approx * approx, x)  # <= approx (60), x (54)
*   64     return approx  # <= approx (60)

The astute reader may find that a statement assert p does not control the following code, although it would be equivalent to if not p: raise Exception. Why is that?

Quiz

Why don't assert statements induce control dependencies?





Indeed: we treat assertions as "neutral" in the sense that they do not affect the remainder of the program – if they are turned off, they have no effect; and if they are turned on, the remaining program logic should not depend on them. (Our instrumentation also has no special treatment of raise or even return statements; they should be handled by our with blocks, though.)

# print(repr(root_slicer))

Removing HTML Markup

Let us come to our ongoing example, remove_html_markup(). This is how its instrumented code looks like:

with Slicer(remove_html_markup) as rhm_slicer:
    s = remove_html_markup("<foo>bar</foo>")

The graph is as discussed in the introduction to this chapter:

rhm_slicer
dependencies assertion_functionremove_html_markupat0x10ea4b910_244 <assertion> assert tag or not quote test_functionremove_html_markupat0x10ea4b910_246 <test> if c == '<' and not quote: tag_functionremove_html_markupat0x10ea4b910_249 tag tag = False tag_functionremove_html_markupat0x10ea4b910_249->assertion_functionremove_html_markupat0x10ea4b910_244 test_functionremove_html_markupat0x10ea4b910_250 <test> elif (c == '"' or c == "'") and tag: test_functionremove_html_markupat0x10ea4b910_252 <test> elif not tag: tag_functionremove_html_markupat0x10ea4b910_249->test_functionremove_html_markupat0x10ea4b910_252 quote_functionremove_html_markupat0x10ea4b910_240 quote quote = False quote_functionremove_html_markupat0x10ea4b910_240->assertion_functionremove_html_markupat0x10ea4b910_244 test_functionremove_html_markupat0x10ea4b910_248 <test> elif c == '>' and not quote: quote_functionremove_html_markupat0x10ea4b910_240->test_functionremove_html_markupat0x10ea4b910_248 quote_functionremove_html_markupat0x10ea4b910_240->test_functionremove_html_markupat0x10ea4b910_246 out_functionremove_html_markupat0x10ea4b910_241 out out = "" tag_functionremove_html_markupat0x10ea4b910_239 tag tag = False tag_functionremove_html_markupat0x10ea4b910_239->assertion_functionremove_html_markupat0x10ea4b910_244 tag_functionremove_html_markupat0x10ea4b910_247 tag tag = True tag_functionremove_html_markupat0x10ea4b910_247->assertion_functionremove_html_markupat0x10ea4b910_244 tag_functionremove_html_markupat0x10ea4b910_247->test_functionremove_html_markupat0x10ea4b910_252 test_functionremove_html_markupat0x10ea4b910_248->tag_functionremove_html_markupat0x10ea4b910_249 test_functionremove_html_markupat0x10ea4b910_248->test_functionremove_html_markupat0x10ea4b910_250 test_functionremove_html_markupat0x10ea4b910_246->tag_functionremove_html_markupat0x10ea4b910_247 test_functionremove_html_markupat0x10ea4b910_246->test_functionremove_html_markupat0x10ea4b910_248 c_functionremove_html_markupat0x10ea4b910_243 c for c in s: c_functionremove_html_markupat0x10ea4b910_243->test_functionremove_html_markupat0x10ea4b910_248 c_functionremove_html_markupat0x10ea4b910_243->test_functionremove_html_markupat0x10ea4b910_246 c_functionremove_html_markupat0x10ea4b910_243->test_functionremove_html_markupat0x10ea4b910_250 out_functionremove_html_markupat0x10ea4b910_253 out out = out + c c_functionremove_html_markupat0x10ea4b910_243->out_functionremove_html_markupat0x10ea4b910_253 s_functionremove_html_markupat0x10ea4b910_238 s def remove_html_markup(s):  # type: ignore s_functionremove_html_markupat0x10ea4b910_238->c_functionremove_html_markupat0x10ea4b910_243 test_functionremove_html_markupat0x10ea4b910_250->test_functionremove_html_markupat0x10ea4b910_252 out_functionremove_html_markupat0x10ea4b910_241->out_functionremove_html_markupat0x10ea4b910_253 test_functionremove_html_markupat0x10ea4b910_252->out_functionremove_html_markupat0x10ea4b910_253 remove_html_markupreturnvalue_functionremove_html_markupat0x10ea4b910_255 <remove_html_markup() return value> return out out_functionremove_html_markupat0x10ea4b910_253->remove_html_markupreturnvalue_functionremove_html_markupat0x10ea4b910_255 out_functionremove_html_markupat0x10ea4b910_253->out_functionremove_html_markupat0x10ea4b910_253
# print(repr(rhm_slicer.dependencies()))
rhm_slicer.code()
*  238 def remove_html_markup(s):  # type: ignore
*  239     tag = False
*  240     quote = False
*  241     out = ""
   242 
*  243     for c in s:  # <= s (238)
*  244         assert tag or not quote  # <= tag (249), quote (240), tag (239), tag (247)
   245 
*  246         if c == '<' and not quote:  # <= quote (240), c (243)
*  247             tag = True  # <- <test> (246)
*  248         elif c == '>' and not quote:  # <= quote (240), c (243); <- <test> (246)
*  249             tag = False  # <- <test> (248)
*  250         elif (c == '"' or c == "'") and tag:  # <= c (243); <- <test> (248)
   251             quote = not quote
*  252         elif not tag:  # <= tag (249), tag (247); <- <test> (250)
*  253             out = out + c  # <= c (243), out (253), out (241); <- <test> (252)
   254 
*  255     return out  # <= out (253)

We can also compute slices over the dependencies:

_, start_remove_html_markup = inspect.getsourcelines(remove_html_markup)
start_remove_html_markup
238
slicing_criterion = ('tag', (remove_html_markup,
                             start_remove_html_markup + 9))
tag_deps = rhm_slicer.dependencies().backward_slice(slicing_criterion)
tag_deps
dependencies quote_functionremove_html_markupat0x10ea4b910_240 quote quote = False test_functionremove_html_markupat0x10ea4b910_246 <test> if c == '<' and not quote: quote_functionremove_html_markupat0x10ea4b910_240->test_functionremove_html_markupat0x10ea4b910_246 c_functionremove_html_markupat0x10ea4b910_243 c for c in s: tag_functionremove_html_markupat0x10ea4b910_247 tag tag = True test_functionremove_html_markupat0x10ea4b910_246->tag_functionremove_html_markupat0x10ea4b910_247 c_functionremove_html_markupat0x10ea4b910_243->test_functionremove_html_markupat0x10ea4b910_246 s_functionremove_html_markupat0x10ea4b910_238 s def remove_html_markup(s):  # type: ignore s_functionremove_html_markupat0x10ea4b910_238->c_functionremove_html_markupat0x10ea4b910_243
# repr(tag_deps)

Calls and Augmented Assign

Our last example covers augmented assigns and data flow across function calls. We introduce two simple functions add_to() and mul_with():

def add_to(n, m):
    n += m
    return n
def mul_with(x, y):
    x *= y
    return x

And we put these two together in a single call:

def test_math() -> None:
    return mul_with(1, add_to(2, 3))
with Slicer(add_to, mul_with, test_math) as math_slicer:
    test_math()

The resulting dependence graph nicely captures the data flow between these calls, notably arguments and parameters:

math_slicer
dependencies x_functionmul_withat0x1087320e0_2 x x *= y mul_withreturnvalue_functionmul_withat0x1087320e0_3 <mul_with() return value> return x x_functionmul_withat0x1087320e0_2->mul_withreturnvalue_functionmul_withat0x1087320e0_3 y_functionmul_withat0x1087320e0_1 y def mul_with(x, y):  # type: ignore y_functionmul_withat0x1087320e0_1->x_functionmul_withat0x1087320e0_2 x_functionmul_withat0x1087320e0_1 x def mul_with(x, y):  # type: ignore x_functionmul_withat0x1087320e0_1->x_functionmul_withat0x1087320e0_2 add_toreturnvalue_functionadd_toat0x10ca56050_3 <add_to() return value> return n add_toreturnvalue_functionadd_toat0x10ca56050_3->y_functionmul_withat0x1087320e0_1 test_mathreturnvalue_functiontest_mathat0x1087323b0_2 <test_math() return value> return mul_with(1, add_to(2, 3)) mul_withreturnvalue_functionmul_withat0x1087320e0_3->test_mathreturnvalue_functiontest_mathat0x1087323b0_2 m_functionadd_toat0x10ca56050_1 m def add_to(n, m):  # type: ignore n_functionadd_toat0x10ca56050_2 n n += m m_functionadd_toat0x10ca56050_1->n_functionadd_toat0x10ca56050_2 n_functionadd_toat0x10ca56050_2->add_toreturnvalue_functionadd_toat0x10ca56050_3 n_functionadd_toat0x10ca56050_1 n def add_to(n, m):  # type: ignore n_functionadd_toat0x10ca56050_1->n_functionadd_toat0x10ca56050_2

These are also reflected in the code view:

math_slicer.code()
*    1 def mul_with(x, y):  # type: ignore  # <= <add_to() return value> (add_to:3)
*    2     x *= y  # <= y (1), x (1)
*    3     return x  # <= x (2)

*    1 def add_to(n, m):  # type: ignore
*    2     n += m  # <= m (1), n (1)
*    3     return n  # <= n (2)

     1 def test_math() -> None:
*    2     return mul_with(1, add_to(2, 3))  # <= <mul_with() return value> (mul_with:3)

Dynamic Instrumentation

When initializing Slicer(), one has to provide the set of functions to be instrumented. This is because the instrumentation has to take place before the code in the with block is executed. Can we determine this list on the fly – while Slicer() is executed?

The answer is: Yes, but the solution is a bit hackish – even more so than what we have seen above. In essence, we proceed in two steps:

  1. When DynamicSlicer.__init__() is called:
    • Use the inspect module to determine the source code of the call
    • Analyze the enclosed with block for function calls
    • Instrument these functions
  2. Whenever a function is about to be called (DataTracker.call())
    • Create an instrumented version of that function
    • Have the call() method return the instrumented function instead
Implementing Dynamic Instrumentation

We start with the aim of determining the with block to which our slicer is applied. The our_with_block() method inspects the source code of the Slicer() caller, and returns the one with block (as an AST) whose source line is currently active.

class WithVisitor(NodeVisitor):
    def __init__(self) -> None:
        self.withs: List[ast.With] = []

    def visit_With(self, node: ast.With) -> AST:
        self.withs.append(node)
        return self.generic_visit(node)
class Slicer(Slicer):
    def our_with_block(self) -> ast.With:
        """Return the currently active `with` block."""
        frame = self.caller_frame()
        source_lines, starting_lineno = inspect.getsourcelines(frame)
        starting_lineno = max(starting_lineno, 1)
        if len(source_lines) == 1:
            # We only get one `with` line, rather than the full block
            # This happens in Jupyter notebooks with iPython 8.1.0 and later.
            # Here's a hacky workaround to get the cell contents:
            # https://stackoverflow.com/questions/51566497/getting-the-source-of-an-object-defined-in-a-jupyter-notebook
            source_lines = inspect.linecache.getlines(inspect.getfile(frame))
            starting_lineno = 1

        source_ast = ast.parse(''.join(source_lines))
        wv = WithVisitor()
        wv.visit(source_ast)

        for with_ast in wv.withs:
            if starting_lineno