# Statistical Debugging¶

In this chapter, we introduce statistical debugging – the idea that specific events during execution could be statistically correlated with failures. We start with coverage of individual lines and then proceed towards further execution features.

from bookutils import YouTubeVideo


Prerequisites

import bookutils


## Synopsis¶

>>> from debuggingbook.StatisticalDebugger import <identifier>


and then make use of the following features.

This chapter introduces classes and techniques for statistical debugging – that is, correlating specific events, such as lines covered, with passing and failing outcomes.

To make use of the code in this chapter, use one of the provided StatisticalDebugger subclasses such as TarantulaDebugger or OchiaiDebugger.

Both are instantiated with a Collector denoting the type of events you want to correlate outcomes with. The default CoverageCollector, collecting line coverage.

### Collecting Events from Calls¶

To collect events from calls that are labeled manually, use

>>> debugger = TarantulaDebugger()
>>> with debugger.collect_pass():
>>>     remove_html_markup("abc")
>>> with debugger.collect_pass():
>>>     remove_html_markup('<b>abc</b>')
>>> with debugger.collect_fail():
>>>     remove_html_markup('"abc"')


Within each with block, the first function call is collected and tracked for coverage. (Note that only the first call is tracked.)

### Collecting Events from Tests¶

To collect events from tests that use exceptions to indicate failure, use the simpler with form:

>>> debugger = TarantulaDebugger()
>>> with debugger:
>>>     remove_html_markup("abc")
>>> with debugger:
>>>     remove_html_markup('<b>abc</b>')
>>> with debugger:
>>>     remove_html_markup('"abc"')
>>>     assert False  # raise an exception


with blocks that raise an exception will be classified as failing, blocks that do not will be classified as passing. Note that exceptions raised are "swallowed" by the debugger.

### Visualizing Events as a Table¶

After collecting events, you can print out the observed events – in this case, line numbers – in a table, showing in which runs they occurred (X), and with colors highlighting the suspiciousness of the event. A "red" event means that the event predominantly occurs in failing runs.

>>> debugger.event_table(args=True, color=True)

remove_html_markup s='abc' s='<b>abc</b>' s='"abc"'
remove_html_markup:1 X X X
remove_html_markup:2 X X X
remove_html_markup:3 X X X
remove_html_markup:4 X X X
remove_html_markup:6 X X X
remove_html_markup:7 X X X
remove_html_markup:8 - X -
remove_html_markup:9 X X X
remove_html_markup:10 - X -
remove_html_markup:11 X X X
remove_html_markup:12 - - X
remove_html_markup:13 X X X
remove_html_markup:14 X X X
remove_html_markup:16 X X X

### Visualizing Suspicious Code¶

If you collected coverage with CoverageCollector, you can also visualize the code with similar colors, highlighting suspicious lines:

>>> debugger

   1 def remove_html_markup(s):  # type: ignore
   2     tag = False
   3     quote = False
   4     out = ""
   5
   6     for c in s:
   7         if c == '<' and not quote:
   8             tag = True
   9         elif c == '>' and not quote:
  10             tag = False
  11         elif c == '"' or c == "'" and tag:
  12             quote = not quote
  13         elif not tag:
  14             out = out + c
  15
  16     return out

### Ranking Events¶

The method rank() returns a ranked list of events, starting with the most suspicious. This is useful for automated techniques that need potential defect locations.

>>> debugger.rank()
[('remove_html_markup', 12),
('remove_html_markup', 3),
('remove_html_markup', 1),
('remove_html_markup', 13),
('remove_html_markup', 6),
('remove_html_markup', 11),
('remove_html_markup', 16),
('remove_html_markup', 4),
('remove_html_markup', 9),
('remove_html_markup', 2),
('remove_html_markup', 14),
('remove_html_markup', 7),
('remove_html_markup', 10),
('remove_html_markup', 8)]


### Classes and Methods¶

Here are all classes defined in this chapter:

## Introduction¶

The idea behind statistical debugging is fairly simple. We have a program that sometimes passes and sometimes fails. This outcome can be correlated with events that precede it – properties of the input, properties of the execution, properties of the program state. If we, for instance, can find that "the program always fails when Line 123 is executed, and it always passes when Line 123 is not executed", then we have a strong correlation between Line 123 being executed and failure.

Such correlation does not necessarily mean causation. For this, we would have to prove that executing Line 123 always leads to failure, and that not executing it does not lead to (this) failure. Also, a correlation (or even a causation) does not mean that Line 123 contains the defect – for this, we would have to show that it actually is an error. Still, correlations make excellent hints as it comes to search for failure causes – in all generality, if you let your search be guided by events that correlate with failures, you are more likely to find important hints on how the failure comes to be.

## Collecting Events¶

How can we determine events that correlate with failure? We start with a general mechanism to actually collect events during execution. The abstract Collector class provides

• a collect() method made for collecting events, called from the traceit() tracer; and
• an events() method made for retrieving these events.

Both of these are abstract and will be defined further in subclasses.

from Tracer import Tracer

from types import FrameType, TracebackType

class Collector(Tracer):
"""A class to record events during execution."""

def collect(self, frame: FrameType, event: str, arg: Any) -> None:
"""Collecting function. To be overridden in subclasses."""
pass

def events(self) -> Set:
"""Return a collection of events. To be overridden in subclasses."""
return set()

def traceit(self, frame: FrameType, event: str, arg: Any) -> None:
self.collect(frame, event, arg)


A Collector class is used like Tracer, using a with statement. Let us apply it on the buggy variant of remove_html_markup() from the Introduction to Debugging:

def remove_html_markup(s):
tag = False
quote = False
out = ""

for c in s:
if c == '<' and not quote:
tag = True
elif c == '>' and not quote:
tag = False
elif c == '"' or c == "'" and tag:
quote = not quote
elif not tag:
out = out + c

return out

with Collector() as c:
out = remove_html_markup('"abc"')
out

'abc'


There's not much we can do with our collector, as the collect() and events() methods are yet empty. However, we can introduce an id() method which returns a string identifying the collector. This string is defined from the first function call encountered.

from types import FunctionType

Coverage = Set[Tuple[Callable, int]]

class Collector(Collector):
def __init__(self) -> None:
"""Constructor."""
self._function: Optional[Callable] = None
self._args: Optional[Dict[str, Any]] = None
self._argstring: Optional[str] = None
self._exception: Optional[Type] = None
self.items_to_ignore: List[Union[Type, Callable]] = [self.__class__]

def traceit(self, frame: FrameType, event: str, arg: Any) -> None:
"""
Tracing function.
Saves the first function and calls collect().
"""
for item in self.items_to_ignore:
if (isinstance(item, type) and 'self' in frame.f_locals and
isinstance(frame.f_locals['self'], item)):
# Ignore this class
return
if item.__name__ == frame.f_code.co_name:
# Ignore this function
return

if self._function is None and event == 'call':
# Save function
self._function = self.create_function(frame)
self._args = frame.f_locals.copy()
self._argstring = ", ".join([f"{var}={repr(self._args[var])}"
for var in self._args])

self.collect(frame, event, arg)

def collect(self, frame: FrameType, event: str, arg: Any) -> None:
"""Collector function. To be overloaded in subclasses."""
pass

def id(self) -> str:
"""Return an identifier for the collector,
created from the first call"""
return f"{self.function().__name__}({self.argstring()})"

def function(self) -> Callable:
"""Return the function from the first call, as a function object"""
if not self._function:
raise ValueError("No call collected")
return self._function

def argstring(self) -> str:
"""
Return the list of arguments from the first call,
as a printable string
"""
if not self._argstring:
raise ValueError("No call collected")
return self._argstring

def args(self) -> Dict[str, Any]:
"""Return a dict of argument names and values from the first call"""
if not self._args:
raise ValueError("No call collected")
return self._args

def exception(self) -> Optional[Type]:
"""Return the exception class from the first call,
or None if no exception was raised."""
return self._exception

def __repr__(self) -> str:
"""Return a string representation of the collector"""
# We use the ID as default representation when printed
return self.id()

def covered_functions(self) -> Set[Callable]:
"""Set of covered functions. To be overloaded in subclasses."""
return set()

def coverage(self) -> Coverage:
"""
Return a set (function, lineno) with locations covered.
"""
return set()


Here's how the collector works. We use a with clause to collect details on a function call:

with Collector() as c:
remove_html_markup('abc')


We can now retrieve details such as the function called...

c.function()

<function __main__.remove_html_markup(s)>


... or its arguments, as a name/value dictionary.

c.args()

{'s': 'abc'}


The id() method returns a printable representation of the call:

c.id()

"remove_html_markup(s='abc')"


The argstring() method does the same for the argument string only.

c.argstring()

"s='abc'"


With this, we can collect the basic information to identify calls – such that we can later correlate their events with success or failure.

### Error Prevention¶

While collecting, we'd like to avoid collecting events in the collection infrastructure. The items_to_ignore attribute takes care of this.

class Collector(Collector):
items_to_ignore: List[Union[Type, Callable]]) \
-> None:
"""
Define additional classes and functions to ignore during collection
(typically Debugger classes using these collectors).
"""
self.items_to_ignore += items_to_ignore


If we exit a block without having collected anything, that's likely an error.

class Collector(Collector):
def __exit__(self, exc_tp: Type, exc_value: BaseException,
exc_traceback: TracebackType) -> Optional[bool]:
"""Exit the with block."""
ret = super().__exit__(exc_tp, exc_value, exc_traceback)

if not self._function:
if exc_tp:
return False  # re-raise exception
else:
raise ValueError("No call collected")

return ret


## Collecting Coverage¶

So far, our Collector class does not collect any events. Let us extend it such that it collects coverage information – that is, the set of locations executed. To this end, we introduce a CoverageCollector subclass which saves the coverage in a set containing functions and line numbers.

from types import FrameType

from StackInspector import StackInspector

class CoverageCollector(Collector, StackInspector):
"""A class to record covered locations during execution."""

def __init__(self) -> None:
"""Constructor."""
super().__init__()
self._coverage: Coverage = set()

def collect(self, frame: FrameType, event: str, arg: Any) -> None:
"""
Save coverage for an observed event.
"""
name = frame.f_code.co_name
function = self.search_func(name, frame)

if function is None:
function = self.create_function(frame)

location = (function, frame.f_lineno)


We also override events() such that it returns the set of covered locations.

class CoverageCollector(CoverageCollector):
def events(self) -> Set[Tuple[str, int]]:
"""
Return the set of locations covered.
Each location comes as a pair (function_name, lineno).
"""
return {(func.__name__, lineno) for func, lineno in self._coverage}


The methods coverage() and covered_functions() allow precise access to the coverage obtained.

class CoverageCollector(CoverageCollector):
def covered_functions(self) -> Set[Callable]:
"""Return a set with all functions covered."""
return {func for func, lineno in self._coverage}

def coverage(self) -> Coverage:
"""Return a set (function, lineno) with all locations covered."""
return self._coverage


Here is how we can use CoverageCollector to determine the lines executed during a run of remove_html_markup():

with CoverageCollector() as c:
remove_html_markup('abc')
c.events()

{('remove_html_markup', 1),
('remove_html_markup', 2),
('remove_html_markup', 3),
('remove_html_markup', 4),
('remove_html_markup', 6),
('remove_html_markup', 7),
('remove_html_markup', 9),
('remove_html_markup', 11),
('remove_html_markup', 13),
('remove_html_markup', 14),
('remove_html_markup', 16)}


Sets of line numbers alone are not too revealing. They provide more insights if we actually list the code, highlighting these numbers:

import inspect

from bookutils import getsourcelines    # like inspect.getsourcelines(), but in color

def code_with_coverage(function: Callable, coverage: Coverage) -> None:
source_lines, starting_line_number = \
getsourcelines(function)

line_number = starting_line_number
for line in source_lines:
marker = '*' if (function, line_number) in coverage else ' '
print(f"{line_number:4} {marker} {line}", end='')
line_number += 1

code_with_coverage(remove_html_markup, c.coverage())

   1 * def remove_html_markup(s):  # type: ignore
2 *     tag = False
3 *     quote = False
4 *     out = ""
5
6 *     for c in s:
7 *         if c == '<' and not quote:
8               tag = True
9 *         elif c == '>' and not quote:
10               tag = False
11 *         elif c == '"' or c == "'" and tag:
12               quote = not quote
13 *         elif not tag:
14 *             out = out + c
15
16 *     return out


Remember that the input s was "abc"? In this listing, we can see which lines were covered and which lines were not. From the listing already, we can see that s has neither tags nor quotes.

Such coverage computation plays a big role in testing, as one wants tests to cover as many different aspects of program execution (and notably code) as possible. But also during debugging, code coverage is essential: If some code was not even executed in the failing run, then any change to it will have no effect.

from bookutils import quiz


### Quiz

Let the input be "<b>Don't do this!</b>". Which of these lines are executed? Use the code to find out!

To find the solution, try this out yourself:

with CoverageCollector() as c:
remove_html_markup("<b>Don't do this!</b>")
# code_with_coverage(remove_html_markup, c.coverage)


## Computing Differences¶

Let us get back to the idea that we want to correlate events with passing and failing outcomes. For this, we need to examine events in both passing and failing runs, and determine their differences – since it is these differences we want to associate with their respective outcome.

### A Base Class for Statistical Debugging¶

The StatisticalDebugger base class takes a collector class (such as CoverageCollector). Its collect() method creates a new collector of that very class, which will be maintained by the debugger. As argument, collect() takes a string characterizing the outcome (such as 'PASS' or 'FAIL'). This is how one would use it:

debugger = StatisticalDebugger()
with debugger.collect('PASS'):
some_passing_run()
with debugger.collect('PASS'):
another_passing_run()
with debugger.collect('FAIL'):
some_failing_run()


Let us implement StatisticalDebugger. The base class gets a collector class as argument:

class StatisticalDebugger:
"""A class to collect events for multiple outcomes."""

def __init__(self, collector_class: Type = CoverageCollector, log: bool = False):
"""Constructor. Use instances of collector_class to collect events."""
self.collector_class = collector_class
self.collectors: Dict[str, List[Collector]] = {}
self.log = log


The collect() method creates (and stores) a collector for the given outcome, using the given outcome to characterize the run. Any additional arguments are passed to the collector.

class StatisticalDebugger(StatisticalDebugger):
def collect(self, outcome: str, *args: Any, **kwargs: Any) -> Collector:
"""Return a collector for the given outcome.
Additional args are passed to the collector."""
collector = self.collector_class(*args, **kwargs)

def add_collector(self, outcome: str, collector: Collector) -> Collector:
if outcome not in self.collectors:
self.collectors[outcome] = []
self.collectors[outcome].append(collector)
return collector


The all_events() method produces a union of all events observed. If an outcome is given, it produces a union of all events with that outcome:

class StatisticalDebugger(StatisticalDebugger):
def all_events(self, outcome: Optional[str] = None) -> Set[Any]:
"""Return a set of all events observed."""
all_events = set()

if outcome:
if outcome in self.collectors:
for collector in self.collectors[outcome]:
all_events.update(collector.events())
else:
for outcome in self.collectors:
for collector in self.collectors[outcome]:
all_events.update(collector.events())

return all_events


Here's a simple example of StatisticalDebugger in action:

s = StatisticalDebugger()
with s.collect('PASS'):
remove_html_markup("abc")
with s.collect('PASS'):
remove_html_markup('<b>abc</b>')
with s.collect('FAIL'):
remove_html_markup('"abc"')


The method all_events() returns all events collected:

s.all_events()

{('remove_html_markup', 1),
('remove_html_markup', 2),
('remove_html_markup', 3),
('remove_html_markup', 4),
('remove_html_markup', 6),
('remove_html_markup', 7),
('remove_html_markup', 8),
('remove_html_markup', 9),
('remove_html_markup', 10),
('remove_html_markup', 11),
('remove_html_markup', 12),
('remove_html_markup', 13),
('remove_html_markup', 14),
('remove_html_markup', 16)}


If given an outcome as argument, we obtain all events with the given outcome.

s.all_events('FAIL')

{('remove_html_markup', 1),
('remove_html_markup', 2),
('remove_html_markup', 3),
('remove_html_markup', 4),
('remove_html_markup', 6),
('remove_html_markup', 7),
('remove_html_markup', 9),
('remove_html_markup', 11),
('remove_html_markup', 12),
('remove_html_markup', 13),
('remove_html_markup', 14),
('remove_html_markup', 16)}


The attribute collectors maps outcomes to lists of collectors:

s.collectors

{'PASS': [remove_html_markup(s='abc'), remove_html_markup(s='<b>abc</b>')],
'FAIL': [remove_html_markup(s='"abc"')]}


Here's the collector of the one (and first) passing run:

s.collectors['PASS'][0].id()

"remove_html_markup(s='abc')"

s.collectors['PASS'][0].events()

{('remove_html_markup', 1),
('remove_html_markup', 2),
('remove_html_markup', 3),
('remove_html_markup', 4),
('remove_html_markup', 6),
('remove_html_markup', 7),
('remove_html_markup', 9),
('remove_html_markup', 11),
('remove_html_markup', 13),
('remove_html_markup', 14),
('remove_html_markup', 16)}


To better highlight the differences between the collected events, we introduce a method event_table() that prints out whether an event took place in a run.

Printing an Event Table
from IPython.display import Markdown

import html

class StatisticalDebugger(StatisticalDebugger):
def function(self) -> Optional[Callable]:
"""
Return the entry function from the events observed,
or None if ambiguous.
"""
names_seen = set()
functions = []
for outcome in self.collectors:
for collector in self.collectors[outcome]:
# We may have multiple copies of the function,
# but sharing the same name
func = collector.function()
if func.__name__ not in names_seen:
functions.append(func)

if len(functions) != 1:
return None  # ambiguous
return functions[0]

def covered_functions(self) -> Set[Callable]:
"""Return a set of all functions observed."""
functions = set()
for outcome in self.collectors:
for collector in self.collectors[outcome]:
functions |= collector.covered_functions()
return functions

def coverage(self) -> Coverage:
"""Return a set of all (functions, line_numbers) observed"""
coverage = set()
for outcome in self.collectors:
for collector in self.collectors[outcome]:
coverage |= collector.coverage()
return coverage

def color(self, event: Any) -> Optional[str]:
"""
Return a color for the given event, or None.
"""
return None

def tooltip(self, event: Any) -> Optional[str]:
"""
Return a tooltip string for the given event, or None.
"""
return None

def event_str(self, event: Any) -> str:
"""Format the given event. To be overloaded in subclasses."""
if isinstance(event, str):
return event
if isinstance(event, tuple):
return ":".join(self.event_str(elem) for elem in event)
return str(event)

def event_table_text(self, *, args: bool = False, color: bool = False) -> str:
"""
Print out a table of events observed.
If args is True, use arguments as headers.
If color is True, use colors.
"""
sep = ' | '
all_events = self.all_events()
longest_event = max(len(f"{self.event_str(event)}")
for event in all_events)
out = ""

if args:
out += '| '
func = self.function()
if func:
out += '' + func.__name__ + ''
out += sep
for name in self.collectors:
for collector in self.collectors[name]:
out += '' + collector.argstring() + '' + sep
out += '\n'
else:
out += '| ' + ' ' * longest_event + sep
for name in self.collectors:
for i in range(len(self.collectors[name])):
out += name + sep
out += '\n'

out += '| ' + '-' * longest_event + sep
for name in self.collectors:
for i in range(len(self.collectors[name])):
out += '-' * len(name) + sep
out += '\n'

# Data
for event in sorted(all_events):
event_name = self.event_str(event).rjust(longest_event)

tooltip = self.tooltip(event)
if tooltip:
title = f' title="{tooltip}"'
else:
title = ''

if color:
color_name = self.color(event)
if color_name:
event_name = \
f'<samp style="background-color: {color_name}"{title}>' \
f'{html.escape(event_name)}' \
f'</samp>'

out += f"| {event_name}" + sep
for name in self.collectors:
for collector in self.collectors[name]:
out += ' ' * (len(name) - 1)
if event in collector.events():
out += "X"
else:
out += "-"
out += sep
out += '\n'

return out

def event_table(self, **_args: Any) -> Any:
"""Print out event table in Markdown format."""
return Markdown(self.event_table_text(**_args))

def __repr__(self) -> str:
return self.event_table_text()

def _repr_markdown_(self) -> str:
return self.event_table_text(args=True, color=True)

s = StatisticalDebugger()
with s.collect('PASS'):
remove_html_markup("abc")
with s.collect('PASS'):
remove_html_markup('<b>abc</b>')
with s.collect('FAIL'):
remove_html_markup('"abc"')

s.event_table(args=True)

remove_html_markup s='abc' s='<b>abc</b>' s='"abc"'
remove_html_markup:1 X X X
remove_html_markup:2 X X X
remove_html_markup:3 X X X
remove_html_markup:4 X X X
remove_html_markup:6 X X X
remove_html_markup:7 X X X
remove_html_markup:8 - X -
remove_html_markup:9 X X X
remove_html_markup:10 - X -
remove_html_markup:11 X X X
remove_html_markup:12 - - X
remove_html_markup:13 X X X
remove_html_markup:14 X X X
remove_html_markup:16 X X X

### Quiz

How many lines are executed in the failing run only?

Indeed, Line 12 executed in the failing run only would be a correlation to look for.

### Collecting Passing and Failing Runs¶

While our StatisticalDebugger class allows arbitrary outcomes, we are typically only interested in two outcomes, namely passing vs. failing runs. We therefore introduce a specialized DifferenceDebugger class that provides customized methods to collect and access passing and failing runs.

class DifferenceDebugger(StatisticalDebugger):
"""A class to collect events for passing and failing outcomes."""

PASS = 'PASS'
FAIL = 'FAIL'

def collect_pass(self, *args: Any, **kwargs: Any) -> Collector:
"""Return a collector for passing runs."""
return self.collect(self.PASS, *args, **kwargs)

def collect_fail(self, *args: Any, **kwargs: Any) -> Collector:
"""Return a collector for failing runs."""
return self.collect(self.FAIL, *args, **kwargs)

def pass_collectors(self) -> List[Collector]:
return self.collectors[self.PASS]

def fail_collectors(self) -> List[Collector]:
return self.collectors[self.FAIL]

def all_fail_events(self) -> Set[Any]:
"""Return all events observed in failing runs."""
return self.all_events(self.FAIL)

def all_pass_events(self) -> Set[Any]:
"""Return all events observed in passing runs."""
return self.all_events(self.PASS)

def only_fail_events(self) -> Set[Any]:
"""Return all events observed only in failing runs."""
return self.all_fail_events() - self.all_pass_events()

def only_pass_events(self) -> Set[Any]:
"""Return all events observed only in passing runs."""
return self.all_pass_events() - self.all_fail_events()


We can use DifferenceDebugger just as a StatisticalDebugger:

def test_debugger_html_simple(debugger: T1) -> T1:
with debugger.collect_pass():
remove_html_markup('abc')
with debugger.collect_pass():
remove_html_markup('<b>abc</b>')
with debugger.collect_fail():
remove_html_markup('"abc"')
return debugger


However, since the outcome of tests may not always be predetermined, we provide a simpler interface for tests that can fail (= raise an exception) or pass (not raise an exception).

class DifferenceDebugger(DifferenceDebugger):
def __enter__(self) -> Any:
"""Enter a with block. Collect coverage and outcome;
classify as FAIL if the block raises an exception,
and PASS if it does not.
"""
self.collector = self.collector_class()
self.collector.__enter__()
return self

def __exit__(self, exc_tp: Type, exc_value: BaseException,
exc_traceback: TracebackType) -> Optional[bool]:
"""Exit the with block."""
status = self.collector.__exit__(exc_tp, exc_value, exc_traceback)

if status is None:
pass
else:
return False  # Internal error; re-raise exception

if exc_tp is None:
outcome = self.PASS
else:
outcome = self.FAIL

return True  # Ignore exception, if any


Using this interface, we can rewrite test_debugger_html():

def test_debugger_html(debugger: T2) -> T2:
with debugger:
remove_html_markup('abc')
with debugger:
remove_html_markup('<b>abc</b>')
with debugger:
remove_html_markup('"abc"')
assert False  # Mark test as failing

return debugger

test_debugger_html(DifferenceDebugger())

remove_html_markup s='abc' s='<b>abc</b>' s='"abc"'
remove_html_markup:1 X X X
remove_html_markup:2 X X X
remove_html_markup:3 X X X
remove_html_markup:4 X X X
remove_html_markup:6 X X X
remove_html_markup:7 X X X
remove_html_markup:8 - X -
remove_html_markup:9 X X X
remove_html_markup:10 - X -
remove_html_markup:11 X X X
remove_html_markup:12 - - X
remove_html_markup:13 X X X
remove_html_markup:14 X X X
remove_html_markup:16 X X X

### Analyzing Events¶

Let us now focus on analyzing events collected. Since events come back as sets, we can compute unions and differences between these sets. For instance, we can compute which lines were executed in any of the passing runs of test_debugger_html(), above:

debugger = test_debugger_html(DifferenceDebugger())

pass_1_events = debugger.pass_collectors()[0].events()

pass_2_events = debugger.pass_collectors()[1].events()

in_any_pass = pass_1_events | pass_2_events
in_any_pass

{('remove_html_markup', 1),
('remove_html_markup', 2),
('remove_html_markup', 3),
('remove_html_markup', 4),
('remove_html_markup', 6),
('remove_html_markup', 7),
('remove_html_markup', 8),
('remove_html_markup', 9),
('remove_html_markup', 10),
('remove_html_markup', 11),
('remove_html_markup', 13),
('remove_html_markup', 14),
('remove_html_markup', 16)}


Likewise, we can determine which lines were only executed in the failing run:

fail_events = debugger.fail_collectors()[0].events()

only_in_fail = fail_events - in_any_pass
only_in_fail

{('remove_html_markup', 12)}


And we see that the "failing" run is characterized by processing quotes:

code_with_coverage(remove_html_markup, only_in_fail)

   1   def remove_html_markup(s):  # type: ignore
2       tag = False
3       quote = False
4       out = ""
5
6       for c in s:
7           if c == '<' and not quote:
8               tag = True
9           elif c == '>' and not quote:
10               tag = False
11           elif c == '"' or c == "'" and tag:
12               quote = not quote
13           elif not tag:
14               out = out + c
15
16       return out

debugger = test_debugger_html(DifferenceDebugger())

debugger.all_events()

{('remove_html_markup', 1),
('remove_html_markup', 2),
('remove_html_markup', 3),
('remove_html_markup', 4),
('remove_html_markup', 6),
('remove_html_markup', 7),
('remove_html_markup', 8),
('remove_html_markup', 9),
('remove_html_markup', 10),
('remove_html_markup', 11),
('remove_html_markup', 12),
('remove_html_markup', 13),
('remove_html_markup', 14),
('remove_html_markup', 16)}


These are the lines executed only in the failing run:

debugger.only_fail_events()

{('remove_html_markup', 12)}


These are the lines executed only in the passing runs:

debugger.only_pass_events()

{('remove_html_markup', 8), ('remove_html_markup', 10)}


Again, having these lines individually is neat, but things become much more interesting if we can see the associated code lines just as well. That's what we will do in the next section.

## Visualizing Differences¶

To show correlations of line coverage in context, we introduce a number of visualization techniques that highlight code with different colors.

### Discrete Spectrum¶

The first idea is to use a discrete spectrum of three colors:

• red for code executed in failing runs only
• green for code executed in passing runs only
• yellow for code executed in both passing and failing runs.

Code that is not executed stays unhighlighted.

We first introduce an abstract class SpectrumDebugger that provides the essential functions. suspiciousness() returns a value between 0 and 1 indicating the suspiciousness of the given event - or None if unknown.

class SpectrumDebugger(DifferenceDebugger):
def suspiciousness(self, event: Any) -> Optional[float]:
"""
Return a suspiciousness value in the range [0, 1.0]
for the given event, or None if unknown.
"""
return None


The tooltip() and percentage() methods convert the suspiciousness into a human-readable form.

class SpectrumDebugger(SpectrumDebugger):
def tooltip(self, event: Any) -> str:
"""
Return a tooltip for the given event (default: percentage).
"""
return self.percentage(event)

def percentage(self, event: Any) -> str:
"""
Return the suspiciousness for the given event as percentage string.
"""
suspiciousness = self.suspiciousness(event)
if suspiciousness is not None:
return str(int(suspiciousness * 100)).rjust(3) + '%'
else:
return ' ' * len('100%')


The code() method takes a function and shows each of its source code lines using the given spectrum, using HTML markup:

class SpectrumDebugger(SpectrumDebugger):
def code(self, functions: Optional[Set[Callable]] = None, *,
color: bool = False, suspiciousness: bool = False,
line_numbers: bool = True) -> str:
"""
Return a listing of functions (default: covered functions).
If color is True, render as HTML, using suspiciousness colors.
If suspiciousness is True, include suspiciousness values.
If line_numbers is True (default), include line numbers.
"""

if not functions:
functions = self.covered_functions()

out = ""
seen = set()
for function in functions:
source_lines, starting_line_number = \
inspect.getsourcelines(function)

if (function.__name__, starting_line_number) in seen:
continue

if out:
out += '\n'
if color:
out += '<p/>'

line_number = starting_line_number
for line in source_lines:
if color:
line = html.escape(line)
if line.strip() == '':
line = '&nbsp;'

location = (function.__name__, line_number)
location_suspiciousness = self.suspiciousness(location)
if location_suspiciousness is not None:
tooltip = f"Line {line_number}: {self.tooltip(location)}"
else:
tooltip = f"Line {line_number}: not executed"

if suspiciousness:
line = self.percentage(location) + ' ' + line

if line_numbers:
line = str(line_number).rjust(4) + ' ' + line

line_color = self.color(location)

if color and line_color:
line = f'''<pre style="background-color:{line_color}"
title="{tooltip}">{line.rstrip()}</pre>'''
elif color:
line = f'<pre title="{tooltip}">{line}</pre>'
else:
line = line.rstrip()

out += line + '\n'
line_number += 1

return out


We introduce a few helper methods to visualize the code with colors in various forms.

class SpectrumDebugger(SpectrumDebugger):
def _repr_html_(self) -> str:
"""When output in Jupyter, visualize as HTML"""
return self.code(color=True)

def __str__(self) -> str:
"""Show code as string"""
return self.code(color=False, suspiciousness=True)

def __repr__(self) -> str:
"""Show code as string"""
return self.code(color=False, suspiciousness=True)


So far, however, central methods like suspiciousness() or color() were abstract – that is, to be defined in subclasses. Our DiscreteSpectrumDebugger subclass provides concrete implementations for these, with color() returning one of the three colors depending on the line number:

class DiscreteSpectrumDebugger(SpectrumDebugger):
"""Visualize differences between executions using three discrete colors"""

def suspiciousness(self, event: Any) -> Optional[float]:
"""
Return a suspiciousness value [0, 1.0]
for the given event, or None if unknown.
"""
passing = self.all_pass_events()
failing = self.all_fail_events()

if event in passing and event in failing:
return 0.5
elif event in failing:
return 1.0
elif event in passing:
return 0.0
else:
return None

def color(self, event: Any) -> Optional[str]:
"""
Return a HTML color for the given event.
"""
suspiciousness = self.suspiciousness(event)
if suspiciousness is None:
return None

if suspiciousness > 0.8:
return 'mistyrose'
if suspiciousness >= 0.5:
return 'lightyellow'

return 'honeydew'

def tooltip(self, event: Any) -> str:
"""Return a tooltip for the given event."""
passing = self.all_pass_events()
failing = self.all_fail_events()

if event in passing and event in failing:
return "in passing and failing runs"
elif event in failing:
return "only in failing runs"
elif event in passing:
return "only in passing runs"
else:
return "never"


This is how the only_pass_events() and only_fail_events() sets look like when visualized with code. The "culprit" line is well highlighted:

debugger = test_debugger_html(DiscreteSpectrumDebugger())

debugger

   1 def remove_html_markup(s):  # type: ignore
   2     tag = False
   3     quote = False
   4     out = ""
   5
   6     for c in s:
   7         if c == '<' and not quote:
   8             tag = True
   9         elif c == '>' and not quote:
  10             tag = False
  11         elif c == '"' or c == "'" and tag:
  12             quote = not quote
  13         elif not tag:
  14             out = out + c
  15
  16     return out

We can clearly see that the failure is correlated with the presence of quotes in the input string (which is an important hint!). But does this also show us immediately where the defect to be fixed is?

### Quiz

Does the line quote = not quote actually contain the defect?

Indeed, it is the governing condition that is wrong – that is, the condition that caused Line 12 to be executed in the first place. In order to fix a program, we have to find a location that

1. causes the failure (i.e., it can be changed to make the failure go away); and
2. is a defect (i.e., contains an error).

In our example above, the highlighted code line is a symptom for the error. To some extent, it is also a cause, since, say, commenting it out would also resolve the given failure, at the cost of causing other failures. However, the preceding condition also is a cause, as is the presence of quotes in the input.

Only one of these also is a defect, though, and that is the preceding condition. Hence, while correlations can provide important hints, they do not necessarily locate defects.

For those of us who may not have color HTML output ready, simply printing the debugger lists suspiciousness values as percentages.

print(debugger)

   1  50% def remove_html_markup(s):  # type: ignore
2  50%     tag = False
3  50%     quote = False
4  50%     out = ""
5
6  50%     for c in s:
7  50%         if c == '<' and not quote:
8   0%             tag = True
9  50%         elif c == '>' and not quote:
10   0%             tag = False
11  50%         elif c == '"' or c == "'" and tag:
12 100%             quote = not quote
13  50%         elif not tag:
14  50%             out = out + c
15
16  50%     return out


### Continuous Spectrum¶

The criterion that an event should only occur in failing runs (and not in passing runs) can be too aggressive. In particular, if we have another run that executes the "culprit" lines, but does not fail, our "only in fail" criterion will no longer be helpful.

Here is an example. The input

<b color="blue">text</b>


will trigger the "culprit" line

quote = not quote


but actually produce an output where the tags are properly stripped:

remove_html_markup('<b color="blue">text</b>')

'text'


As a consequence, we no longer have lines that are being executed only in failing runs:

debugger = test_debugger_html(DiscreteSpectrumDebugger())
with debugger.collect_pass():

debugger.only_fail_events()

set()


In our spectrum output, the effect now is that the "culprit" line is as yellow as all others.

debugger

   1 def remove_html_markup(s):  # type: ignore
   2     tag = False
   3     quote = False
   4     out = ""
   5
   6     for c in s:
   7         if c == '<' and not quote:
   8             tag = True
   9         elif c == '>' and not quote:
  10             tag = False
  11         elif c == '"' or c == "'" and tag:
  12             quote = not quote
  13         elif not tag:
  14             out = out + c
  15
  16     return out

We therefore introduce a different method for highlighting lines, based on their relative occurrence with respect to all runs: If a line has been mostly executed in failing runs, its color should shift towards red; if a line has been mostly executed in passing runs, its color should shift towards green.

This continuous spectrum has been introduced by the seminal Tarantula tool [Jones et al, 2002]. In Tarantula, the color hue for each line is defined as follows:

$$\textit{color hue}(\textit{line}) = \textit{low color(red)} + \frac{\%\textit{passed}(\textit{line})}{\%\textit{passed}(\textit{line}) + \%\textit{failed}(\textit{line})} \times \textit{color range}$$

Here, %passed and %failed denote the percentage at which a line has been executed in passing and failing runs, respectively. A hue of 0.0 stands for red, a hue of 1.0 stands for green, and a hue of 0.5 stands for equal fractions of red and green, yielding yellow.

We can implement these measures right away as methods in a new ContinuousSpectrumDebugger class:

class ContinuousSpectrumDebugger(DiscreteSpectrumDebugger):
"""Visualize differences between executions using a color spectrum"""

def collectors_with_event(self, event: Any, category: str) -> Set[Collector]:
"""
Return all collectors in a category
that observed the given event.
"""
all_runs = self.collectors[category]
collectors_with_event = set(collector for collector in all_runs
if event in collector.events())
return collectors_with_event

def collectors_without_event(self, event: Any, category: str) -> Set[Collector]:
"""
Return all collectors in a category
that did not observe the given event.
"""
all_runs = self.collectors[category]
collectors_without_event = set(collector for collector in all_runs
if event not in collector.events())
return collectors_without_event

def event_fraction(self, event: Any, category: str) -> float:
if category not in self.collectors:
return 0.0

all_collectors = self.collectors[category]
collectors_with_event = self.collectors_with_event(event, category)
fraction = len(collectors_with_event) / len(all_collectors)
# print(f"%{category}({event}) = {fraction}")
return fraction

def passed_fraction(self, event: Any) -> float:
return self.event_fraction(event, self.PASS)

def failed_fraction(self, event: Any) -> float:
return self.event_fraction(event, self.FAIL)

def hue(self, event: Any) -> Optional[float]:
"""Return a color hue from 0.0 (red) to 1.0 (green)."""
passed = self.passed_fraction(event)
failed = self.failed_fraction(event)
if passed + failed > 0:
return passed / (passed + failed)
else:
return None


Having a continuous hue also implies a continuous suspiciousness and associated tooltips:

class ContinuousSpectrumDebugger(ContinuousSpectrumDebugger):
def suspiciousness(self, event: Any) -> Optional[float]:
hue = self.hue(event)
if hue is None:
return None
return 1 - hue

def tooltip(self, event: Any) -> str:
return self.percentage(event)


The hue for lines executed only in failing runs is (deep) red, as expected:

debugger = test_debugger_html(ContinuousSpectrumDebugger())

for location in debugger.only_fail_events():
print(location, debugger.hue(location))

('remove_html_markup', 12) 0.0


Likewise, the hue for lines executed in passing runs is (deep) green:

for location in debugger.only_pass_events():
print(location, debugger.hue(location))

('remove_html_markup', 8) 1.0
('remove_html_markup', 10) 1.0


The Tarantula tool not only sets the hue for a line, but also uses brightness as measure for support – that is, how often was the line executed at all. The brighter a line, the stronger the correlation with a passing or failing outcome.

The brightness is defined as follows:

$$\textit{brightness}(line) = \max(\%\textit{passed}(\textit{line}), \%\textit{failed}(\textit{line}))$$

and it is easily implemented, too:

class ContinuousSpectrumDebugger(ContinuousSpectrumDebugger):
def brightness(self, event: Any) -> float:
return max(self.passed_fraction(event), self.failed_fraction(event))


Our single "only in fail" line has a brightness of 1.0 (the maximum).

debugger = test_debugger_html(ContinuousSpectrumDebugger())
for location in debugger.only_fail_events():
print(location, debugger.brightness(location))

('remove_html_markup', 12) 1.0


With this, we can now define a color for each line. To this end, we override the (previously discrete) color() method such that it returns a color specification giving hue and brightness. We use the HTML format hsl(hue, saturation, lightness) where the hue is given as a value between 0 and 360 (0 is red, 120 is green) and saturation and lightness are provided as percentages.

class ContinuousSpectrumDebugger(ContinuousSpectrumDebugger):
def color(self, event: Any) -> Optional[str]:
hue = self.hue(event)
if hue is None:
return None
saturation = self.brightness(event)

# HSL color values are specified with:
# hsl(hue, saturation, lightness).
return f"hsl({hue * 120}, {saturation * 100}%, 80%)"

debugger = test_debugger_html(ContinuousSpectrumDebugger())


Lines executed only in failing runs are still shown in red:

for location in debugger.only_fail_events():
print(location, debugger.color(location))

('remove_html_markup', 12) hsl(0.0, 100.0%, 80%)


... whereas lines executed only in passing runs are still shown in green:

for location in debugger.only_pass_events():
print(location, debugger.color(location))

('remove_html_markup', 8) hsl(120.0, 50.0%, 80%)
('remove_html_markup', 10) hsl(120.0, 50.0%, 80%)

debugger

   1 def remove_html_markup(s):  # type: ignore
   2     tag = False
   3     quote = False
   4     out = ""
   5
   6     for c in s:
   7         if c == '<' and not quote:
   8             tag = True
   9         elif c == '>' and not quote:
  10             tag = False
  11         elif c == '"' or c == "'" and tag:
  12             quote = not quote
  13         elif not tag:
  14             out = out + c
  15
  16     return out

What happens with our quote = not quote "culprit" line if it is executed in passing runs, too?

with debugger.collect_pass():


### Quiz

In which color will the quote = not quote "culprit" line be shown after executing the above code?

We see that it still is shown with an orange-red tint.

debugger

   1 def remove_html_markup(s):  # type: ignore
   2     tag = False
   3     quote = False
   4     out = ""
   5
   6     for c in s:
   7         if c == '<' and not quote:
   8             tag = True
   9         elif c == '>' and not quote:
  10             tag = False
  11         elif c == '"' or c == "'" and tag:
  12             quote = not quote
  13         elif not tag:
  14             out = out + c
  15
  16     return out

Here's another example, coming right from the Tarantula paper. The middle() function takes three numbers x, y, and z, and returns the one that is neither the minimum nor the maximum of the three:

def middle(x, y, z):
if y < z:
if x < y:
return y
elif x < z:
return y
else:
if x > y:
return y
elif x > z:
return x
return z

middle(1, 2, 3)

2


Unfortunately, middle() can fail:

middle(2, 1, 3)

1


Let is see whether we can find the bug with a few additional test cases:

def test_debugger_middle(debugger: T3) -> T3:
with debugger.collect_pass():
middle(3, 3, 5)
with debugger.collect_pass():
middle(1, 2, 3)
with debugger.collect_pass():
middle(3, 2, 1)
with debugger.collect_pass():
middle(5, 5, 5)
with debugger.collect_pass():
middle(5, 3, 4)
with debugger.collect_fail():
middle(2, 1, 3)
return debugger


Note that in order to collect data from multiple function invocations, you need to have a separate with clause for every invocation. The following will not work correctly:

with debugger.collect_pass():
middle(3, 3, 5)
middle(1, 2, 3)
...

debugger = test_debugger_middle(ContinuousSpectrumDebugger())

debugger.event_table(args=True)

middle z=5, y=3, x=3 z=3, y=2, x=1 z=1, y=2, x=3 z=5, y=5, x=5 z=4, y=3, x=5 z=3, y=1, x=2
middle:1 X X X X X X
middle:2 X X X X X X
middle:3 X X - - X X
middle:4 - X - - - -
middle:5 X - - - X X
middle:6 X - - - - X
middle:8 - - X X - -
middle:9 - - X - - -
middle:10 - - - X - -
middle:12 - - - X X -

Here comes the visualization. We see that the return y line is the culprit here – and actually also the one to be fixed.

debugger

   1 def middle(x, y, z):  # type: ignore
   2     if y < z:
   3         if x < y:
   4             return y
   5         elif x < z:
   6             return y
   7     else:

   8         if x > y:
   9             return y
  10         elif x > z:
  11             return x

  12     return z

### Quiz

Which of the above lines should be fixed?

Indeed, in the middle() example, the "reddest" line is also the one to be fixed. Here is the fixed version:

def middle_fixed(x, y, z):
if y < z:
if x < y:
return y
elif x < z:
return x
else:
if x > y:
return y
elif x > z:
return x
return z

middle_fixed(2, 1, 3)

2


## Ranking Lines by Suspiciousness¶

In a large program, there can be several locations (and events) that could be flagged as suspicious. It suffices that some large code block of say, 1,000 lines, is mostly executed in failing runs, and then all of this code block will be visualized in some shade of red.

To further highlight the "most suspicious" events, one idea is to use a ranking – that is, coming up with a list of events where those events most correlated with failures would be shown at the top. The programmer would then examine these events one by one and proceed down the list. We will show how this works for two "correlation" metrics – first the Tarantula metric, as introduced above, and then the Ochiai metric, which has shown to be one of the best "ranking" metrics.

We introduce a base class RankingDebugger with an abstract method suspiciousness() to be overloaded in subclasses. The method rank() returns a list of all events observed, sorted by suspiciousness, highest first.

class RankingDebugger(DiscreteSpectrumDebugger):
"""Rank events by their suspiciousness"""

def rank(self) -> List[Any]:
"""Return a list of events, sorted by suspiciousness, highest first."""

def susp(event: Any) -> float:
suspiciousness = self.suspiciousness(event)
assert suspiciousness is not None
return suspiciousness

events = list(self.all_events())
events.sort(key=susp, reverse=True)
return events

def __repr__(self) -> str:
return repr(self.rank())


### The Tarantula Metric¶

We can use the Tarantula metric to sort lines according to their suspiciousness. The "redder" a line (a hue of 0.0), the more suspicious it is. We can simply define

$$\textit{suspiciousness}_\textit{tarantula}(\textit{event}) = 1 - \textit{color hue}(\textit{event})$$

where $\textit{color hue}$ is as defined above.

This is exactly the suspiciousness() function as already implemented in our ContinuousSpectrumDebugger.

We introduce the TarantulaDebugger class, inheriting visualization capabilities from the ContinuousSpectrumDebugger class as well as the suspiciousness features from the RankingDebugger class.

class TarantulaDebugger(ContinuousSpectrumDebugger, RankingDebugger):
"""Spectrum-based Debugger using the Tarantula metric for suspiciousness"""
pass


Let us list remove_html_markup() with highlighted lines again:

tarantula_html = test_debugger_html(TarantulaDebugger())

tarantula_html

   1 def remove_html_markup(s):  # type: ignore
   2     tag = False
   3     quote = False
   4     out = ""
   5
   6     for c in s:
   7         if c == '<' and not quote:
   8             tag = True
   9         elif c == '>' and not quote:
  10             tag = False
  11         elif c == '"' or c == "'" and tag:
  12             quote = not quote
  13         elif not tag:
  14             out = out + c
  15
  16     return out

Here's our ranking of lines, from most suspicious to least suspicious:

tarantula_html.rank()

[('remove_html_markup', 12),
('remove_html_markup', 3),
('remove_html_markup', 1),
('remove_html_markup', 13),
('remove_html_markup', 6),
('remove_html_markup', 11),
('remove_html_markup', 16),
('remove_html_markup', 4),
('remove_html_markup', 9),
('remove_html_markup', 2),
('remove_html_markup', 14),
('remove_html_markup', 7),
('remove_html_markup', 10),
('remove_html_markup', 8)]

tarantula_html.suspiciousness(tarantula_html.rank()[0])

1.0


We see that the first line in the list is indeed the most suspicious; the two "green" lines come at the very end.

For the middle() function, we also obtain a ranking from "reddest" to "greenest".

tarantula_middle = test_debugger_middle(TarantulaDebugger())

tarantula_middle

   1 def middle(x, y, z):  # type: ignore
   2     if y < z:
   3         if x < y:
   4             return y
   5         elif x < z:
   6             return y
   7     else:

   8         if x > y:
   9             return y
  10         elif x > z:
  11             return x

  12     return z
tarantula_middle.rank()

[('middle', 6),
('middle', 5),
('middle', 3),
('middle', 1),
('middle', 2),
('middle', 12),
('middle', 10),
('middle', 4),
('middle', 9),
('middle', 8)]

tarantula_middle.suspiciousness(tarantula_middle.rank()[0])

0.8333333333333333


### The Ochiai Metric¶

The Ochiai Metric [Akira Ochiai, 1957] first introduced in the biology domain [da Silva Meyer et al, 2004] and later applied for fault localization by Abreu et al. [Abreu et al, 2009], is defined as follows:

$$\textit{suspiciousness}_\textit{ochiai} = \frac {\textit{failed}(\textit{event})} {\sqrt{ \bigl(\textit{failed}(\textit{event}) + \textit{not-in-failed}(\textit{event})\bigr) \times \bigl(\textit{failed}(\textit{event}) + \textit{passed}(\textit{event})\bigr) }}$$

where

• $\textit{failed}(\textit{event})$ is the number of times the event occurred in failing runs
• $\textit{not-in-failed}(\textit{event})$ is the number of times the event did not occur in failing runs
• $\textit{passed}(\textit{event})$ is the number of times the event occurred in passing runs.

We can easily implement this formula:

import math

class OchiaiDebugger(ContinuousSpectrumDebugger, RankingDebugger):
"""Spectrum-based Debugger using the Ochiai metric for suspiciousness"""

def suspiciousness(self, event: Any) -> Optional[float]:
failed = len(self.collectors_with_event(event, self.FAIL))
not_in_failed = len(self.collectors_without_event(event, self.FAIL))
passed = len(self.collectors_with_event(event, self.PASS))

try:
return failed / math.sqrt((failed + not_in_failed) * (failed + passed))
except ZeroDivisionError:
return None

def hue(self, event: Any) -> Optional[float]:
suspiciousness = self.suspiciousness(event)
if suspiciousness is None:
return None
return 1 - suspiciousness


Applied on the remove_html_markup() function, the individual suspiciousness scores differ from Tarantula. However, we obtain a very similar visualization, and the same ranking.

ochiai_html = test_debugger_html(OchiaiDebugger())

ochiai_html

   1 def remove_html_markup(s):  # type: ignore
   2     tag = False
   3     quote = False
   4     out = ""
   5
   6     for c in s:
   7         if c == '<' and not quote:
   8             tag = True
   9         elif c == '>' and not quote:
  10             tag = False
  11         elif c == '"' or c == "'" and tag:
  12             quote = not quote
  13         elif not tag:
  14             out = out + c
  15
  16     return out
ochiai_html.rank()

[('remove_html_markup', 12),
('remove_html_markup', 3),
('remove_html_markup', 1),
('remove_html_markup', 13),
('remove_html_markup', 6),
('remove_html_markup', 11),
('remove_html_markup', 16),
('remove_html_markup', 4),
('remove_html_markup', 9),
('remove_html_markup', 2),
('remove_html_markup', 14),
('remove_html_markup', 7),
('remove_html_markup', 10),
('remove_html_markup', 8)]

ochiai_html.suspiciousness(ochiai_html.rank()[0])

1.0


The same observations also apply for the middle() function.

ochiai_middle = test_debugger_middle(OchiaiDebugger())

ochiai_middle

   1 def middle(x, y, z):  # type: ignore
   2     if y < z:
   3         if x < y:
   4             return y
   5         elif x < z:
   6             return y
   7     else:

   8         if x > y:
   9             return y
  10         elif x > z:
  11             return x

  12     return z
ochiai_middle.rank()

[('middle', 6),
('middle', 5),
('middle', 3),
('middle', 1),
('middle', 2),
('middle', 12),
('middle', 10),
('middle', 4),
('middle', 9),
('middle', 8)]

ochiai_middle.suspiciousness(ochiai_middle.rank()[0])

0.7071067811865475


### How Useful is Ranking?¶

So, which metric is better? The standard method to evaluate such rankings is to determine a ground truth – that is, the set of locations that eventually are fixed – and to check at which point in the ranking any such location occurs – the earlier, the better. In our remove_html_markup() and middle() examples, both the Tarantula and the Ochiai metric perform flawlessly, as the "culprit" line is always ranked at the top. However, this need not always be the case; the exact performance depends on the nature of the code and the observed runs. (Also, the question of whether there always is exactly one possible location where the program can be fixed is open for discussion.)

You will be surprised that over time, several dozen metrics have been proposed [Wong et al, 2016], each performing somewhat better or somewhat worse depending on which benchmark they were applied on. The two metrics discussed above each have their merits – the Tarantula metric was among the first such metrics, and the Ochiai metric is generally shown to be among the most effective ones [Abreu et al, 2009].

While rankings can be easily evaluated, it is not necessarily clear whether and how much they serve programmers. As stated above, the assumption of rankings is that developers examine one potentially defective statement after another until they find the actually defective one. However, in a series of human studies with developers, Parnin and Orso [Parnin et al, 2011] found that this assumption may not hold:

It is unclear whether developers can actually determine the faulty nature of a statement by simply looking at it, without any additional information (e.g., the state of the program when the statement was executed or the statements that were executed before or after that one).

In their study, they found that rankings could help completing a task faster, but this effect was limited to experienced developers and simpler code. Artificially changing the rank of faulty statements had little to no effect, implying that developers would not strictly follow the ranked list of statements, but rather search through the code to understand it. At this point, a visualization as in the Tarantula tool can be helpful to programmers as it guides the search, but a ranking that defines where to search may be less useful.

Having said that, ranking has its merits – notably as it comes to informing automated debugging techniques. In the chapter on program repair, we will see how ranked lists of potentially faulty statements tell automated repair techniques where to try to repair the program first. And once such a repair is successful, we have a very strong indication on where and how the program could be fixed!

## Using Large Test Suites¶

In fault localization, the larger and the more thorough the test suite, the higher the precision. Let us try out what happens if we extend the middle() test suite with additional test cases.

The function middle_testcase() returns a random input for middle():

import random

def middle_testcase() -> Tuple[int, int, int]:
x = random.randrange(10)
y = random.randrange(10)
z = random.randrange(10)
return x, y, z

[middle_testcase() for i in range(5)]

[(1, 9, 9), (0, 9, 2), (9, 0, 4), (7, 2, 1), (6, 1, 5)]


The function middle_test() simply checks if middle() operates correctly – by placing x, y, and z in a list, sorting it, and checking the middle argument. If middle() fails, middle_test() raises an exception.

def middle_test(x: int, y: int, z: int) -> None:
m = middle(x, y, z)
assert m == sorted([x, y, z])[1]

middle_test(4, 5, 6)

from ExpectError import ExpectError

with ExpectError():
middle_test(2, 1, 3)

Traceback (most recent call last):
File "<ipython-input-1-ae2957225406>", line 2, in <module>
middle_test(2, 1, 3)
File "<ipython-input-1-e1407680b9f2>", line 3, in middle_test
assert m == sorted([x, y, z])[1]
AssertionError (expected)


The function middle_passing_testcase() searches and returns a triple x, y, z that causes middle_test() to pass.

def middle_passing_testcase() -> Tuple[int, int, int]:
while True:
try:
x, y, z = middle_testcase()
middle_test(x, y, z)
return x, y, z
except AssertionError:
pass

(x, y, z) = middle_passing_testcase()
m = middle(x, y, z)
print(f"middle({x}, {y}, {z}) = {m}")

middle(2, 6, 7) = 6


The function middle_failing_testcase() does the same; but its triple x, y, z causes middle_test() to fail.

def middle_failing_testcase() -> Tuple[int, int, int]:
while True:
try:
x, y, z = middle_testcase()
middle_test(x, y, z)
except AssertionError:
return x, y, z

(x, y, z) = middle_failing_testcase()
m = middle(x, y, z)
print(f"middle({x}, {y}, {z}) = {m}")

middle(5, 4, 6) = 4


With these, we can define two sets of test cases, each with 100 inputs.

MIDDLE_TESTS = 100

MIDDLE_PASSING_TESTCASES = [middle_passing_testcase()
for i in range(MIDDLE_TESTS)]

MIDDLE_FAILING_TESTCASES = [middle_failing_testcase()
for i in range(MIDDLE_TESTS)]


Let us run the OchiaiDebugger with these two test sets.

ochiai_middle = OchiaiDebugger()

for x, y, z in MIDDLE_PASSING_TESTCASES:
with ochiai_middle.collect_pass():
middle(x, y, z)

for x, y, z in MIDDLE_FAILING_TESTCASES:
with ochiai_middle.collect_fail():
middle(x, y, z)

ochiai_middle

   1 def middle(x, y, z):  # type: ignore
   2     if y < z:
   3         if x < y:
   4             return y
   5         elif x < z:
   6             return y
   7     else:

   8         if x > y:
   9             return y
  10         elif x > z:
  11             return x
  12     return z

We see that the "culprit" line is still the most likely to be fixed, but the two conditions leading to the error (x < y and x < z) are also listed as potentially faulty. That is because the error might also be fixed be changing these conditions – although this would result in a more complex fix.

## Other Events besides Coverage¶

We close this chapter with two directions for further thought. If you wondered why in the above code, we were mostly talking about events rather than lines covered, that is because our framework allows for tracking arbitrary events, not just coverage. In fact, any data item a collector can extract from the execution can be used for correlation analysis. (It may not be so easily visualized, though.)

Here's an example. We define a ValueCollector class that collects pairs of (local) variables and their values during execution. Its events() method then returns the set of all these pairs.

class ValueCollector(Collector):
""""A class to collect local variables and their values."""

def __init__(self) -> None:
"""Constructor."""
super().__init__()
self.vars: Set[str] = set()

def collect(self, frame: FrameType, event: str, arg: Any) -> None:
local_vars = frame.f_locals
for var in local_vars:
value = local_vars[var]

def events(self) -> Set[str]:
"""A set of (variable, value) pairs observed"""
return self.vars


If we apply this collector on our set of HTML test cases, these are all the events that we obtain – essentially all variables and all values ever seen:

debugger = test_debugger_html(ContinuousSpectrumDebugger(ValueCollector))
for event in debugger.all_events():
print(event)

s = 'abc'
c = '"'
c = 'a'
s = '"abc"'
c = 'c'
c = '>'
tag = False
c = 'b'
out = 'abc'
quote = True
out = ''
s = '<b>abc</b>'
tag = True
quote = False
c = '<'
out = 'a'
out = 'ab'
c = '/'


However, some of these events only occur in the failing run:

for event in debugger.only_fail_events():
print(event)

c = '"'
s = '"abc"'
quote = True


Some of these differences are spurious – the string "abc" (with quotes) only occurs in the failing run – but others, such as quote being True and c containing a single quote are actually relevant for explaining when the failure comes to be.

We can even visualize the suspiciousness of the individual events, setting the (so far undiscussed) color flag for producing an event table:

debugger.event_table(color=True, args=True)

remove_html_markup s='abc' s='<b>abc</b>' s='"abc"'
c = '"' - - X
c = '/' - X -
c = '<' - X -
c = '>' - X -
c = 'a' X X X
c = 'b' X X X
c = 'c' X X X
out = '' X X X
out = 'a' X X X
out = 'ab' X X X
out = 'abc' X X X
quote = False X X X
quote = True - - X
s = '"abc"' - - X
s = '<b>abc</b>' - X -
s = 'abc' X - -
tag = False X X X
tag = True - X -

There are many ways one can continue from here.

• Rather than checking for concrete values, one could check for more abstract properties, for instance – what is the sign of the value? What is the length of the string?
• One could check for specifics of the control flow – is the loop taken? How many times?
• One could check for specifics of the information flow – which values flow from one variable to another?

There are lots of properties that all could be related to failures – and if we happen to check for the right one, we may obtain a much crisper definition of what causes the failure. We will come up with more ideas on properties to check as it comes to mining specifications.

## Training Classifiers¶

The metrics we have discussed so far are pretty generic – that is, they are fixed no matter how the actual event space is structured. The field of machine learning has come up with techniques that learn classifiers from a given set of data – classifiers that are trained from labeled data and then can predict labels for new data sets. In our case, the labels are test outcomes (PASS and FAIL), whereas the data would be features of the events observed.

A classifier by itself is not immediately useful for debugging (although it could predict whether future inputs will fail or not). Some classifiers, however, have great diagnostic quality; that is, they can explain how their classification comes to be. Decision trees fall into this very category.

A decision tree contains a number of nodes, each one associated with a predicate. Depending on whether the predicate is true or false, we follow the given "true" or "false" branch to end up in the next node, which again contains a predicate. Eventually, we end up in the outcome predicted by the tree. The neat thing is that the node predicates actually give important hints on the circumstances that are most relevant for deciding the outcome.

Let us illustrate this with an example. We build a class ClassifyingDebugger that trains a decision tree from the events collected. To this end, we need to set up our input data such that it can be fed into a classifier.

We start with identifying our samples (runs) and the respective labels (outcomes). All values have to be encoded into numerical values.

class ClassifyingDebugger(DifferenceDebugger):
"""A debugger implementing a decision tree for events"""

PASS_VALUE = +1.0
FAIL_VALUE = -1.0

def samples(self) -> Dict[str, float]:
samples = {}
for collector in self.pass_collectors():
samples[collector.id()] = self.PASS_VALUE
for collector in debugger.fail_collectors():
samples[collector.id()] = self.FAIL_VALUE
return samples

debugger = test_debugger_html(ClassifyingDebugger())
debugger.samples()

{"remove_html_markup(s='abc')": 1.0,
"remove_html_markup(s='<b>abc</b>')": 1.0,
'remove_html_markup(s=\'"abc"\')': -1.0}


Next, we identify the features, which in our case is the set of lines executed in each sample:

class ClassifyingDebugger(ClassifyingDebugger):
def features(self) -> Dict[str, Any]:
features = {}
for collector in debugger.pass_collectors():
features[collector.id()] = collector.events()
for collector in debugger.fail_collectors():
features[collector.id()] = collector.events()
return features

debugger = test_debugger_html(ClassifyingDebugger())
debugger.features()

{"remove_html_markup(s='abc')": {('remove_html_markup', 1),
('remove_html_markup', 2),
('remove_html_markup', 3),
('remove_html_markup', 4),
('remove_html_markup', 6),
('remove_html_markup', 7),
('remove_html_markup', 9),
('remove_html_markup', 11),
('remove_html_markup', 13),
('remove_html_markup', 14),
('remove_html_markup', 16)},
"remove_html_markup(s='<b>abc</b>')": {('remove_html_markup', 1),
('remove_html_markup', 2),
('remove_html_markup', 3),
('remove_html_markup', 4),
('remove_html_markup', 6),
('remove_html_markup', 7),
('remove_html_markup', 8),
('remove_html_markup', 9),
('remove_html_markup', 10),
('remove_html_markup', 11),
('remove_html_markup', 13),
('remove_html_markup', 14),
('remove_html_markup', 16)},
'remove_html_markup(s=\'"abc"\')': {('remove_html_markup', 1),
('remove_html_markup', 2),
('remove_html_markup', 3),
('remove_html_markup', 4),
('remove_html_markup', 6),
('remove_html_markup', 7),
('remove_html_markup', 9),
('remove_html_markup', 11),
('remove_html_markup', 12),
('remove_html_markup', 13),
('remove_html_markup', 14),
('remove_html_markup', 16)}}


All our features have names, which must be strings.

class ClassifyingDebugger(ClassifyingDebugger):
def feature_names(self) -> List[str]:
return [repr(feature) for feature in self.all_events()]

debugger = test_debugger_html(ClassifyingDebugger())
debugger.feature_names()

["('remove_html_markup', 10)",
"('remove_html_markup', 3)",
"('remove_html_markup', 8)",
"('remove_html_markup', 1)",
"('remove_html_markup', 13)",
"('remove_html_markup', 6)",
"('remove_html_markup', 11)",
"('remove_html_markup', 16)",
"('remove_html_markup', 4)",
"('remove_html_markup', 9)",
"('remove_html_markup', 2)",
"('remove_html_markup', 14)",
"('remove_html_markup', 7)",
"('remove_html_markup', 12)"]


Next, we define the shape for an individual sample, which is a value of +1 or -1 for each feature seen (i.e., +1 if the line was covered, -1 if not).

class ClassifyingDebugger(ClassifyingDebugger):
def shape(self, sample: str) -> List[float]:
x = []
features = self.features()
for f in self.all_events():
if f in features[sample]:
x += [+1.0]
else:
x += [-1.0]
return x

debugger = test_debugger_html(ClassifyingDebugger())
debugger.shape("remove_html_markup(s='abc')")

[-1.0, 1.0, -1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, -1.0]


Our input X for the classifier now is a list of such shapes, one for each sample.

class ClassifyingDebugger(ClassifyingDebugger):
def X(self) -> List[List[float]]:
X = []
samples = self.samples()
for key in samples:
X += [self.shape(key)]
return X

debugger = test_debugger_html(ClassifyingDebugger())
debugger.X()

[[-1.0, 1.0, -1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, -1.0],
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, -1.0],
[-1.0, 1.0, -1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]


Our input Y for the classifier, in contrast, is the list of labels, again indexed by sample.

class ClassifyingDebugger(ClassifyingDebugger):
def Y(self) -> List[float]:
Y = []
samples = self.samples()
for key in samples:
Y += [samples[key]]
return Y

debugger = test_debugger_html(ClassifyingDebugger())
debugger.Y()

[1.0, 1.0, -1.0]


We now have all our data ready to be fit into a tree classifier. The method classifier() creates and returns the (tree) classifier for the observed runs.

from sklearn.tree import DecisionTreeClassifier, export_text, export_graphviz

class ClassifyingDebugger(ClassifyingDebugger):
def classifier(self) -> DecisionTreeClassifier:
classifier = DecisionTreeClassifier()
classifier = classifier.fit(self.X(), self.Y())
return classifier


We define a special method to show classifiers:

import graphviz

class ClassifyingDebugger(ClassifyingDebugger):
def show_classifier(self, classifier: DecisionTreeClassifier) -> Any:
dot_data = export_graphviz(classifier, out_file=None,
filled=False, rounded=True,
feature_names=self.feature_names(),
class_names=["FAIL", "PASS"],
label='none',
node_ids=False,
impurity=False,
proportion=True,
special_characters=True)

return graphviz.Source(dot_data)


This is the tree we get for our remove_html_markup() tests. The top predicate is whether the "culprit" line was executed (-1 means no, +1 means yes). If not (-1), the outcome is PASS. Otherwise, the outcome is TRUE.

debugger = test_debugger_html(ClassifyingDebugger())
classifier = debugger.classifier()
debugger.show_classifier(classifier)


We can even use our classifier to predict the outcome of additional runs. If, for instance, we execute all lines except for, say, Line 7, 9, and 11, our tree classifier would predict failure – because the "culprit" line 12 is executed.

classifier.predict([[1, 1, 1, 1, 1, 1, -1, 1, -1, 1, -1, 1, 1, 1]])

array([-1.])


Again, there are many ways to continue from here. Which events should we train the classifier from? How do classifiers compare in their performance and diagnostic quality? There are lots of possibilities left to explore, and we only begin to realize the potential for automated debugging.

## Lessons Learned¶

• Correlations between execution events and outcomes (pass/fail) can make important hints for debugging
• Events occurring only (or mostly) during failing runs can be highlighted and ranked to guide the search
• Important hints include whether the execution of specific code locations correlates with failure

## Next Steps¶

Chapters that build on this one include

## Background¶

The seminal works on statistical debugging are two papers:

• "Visualization of Test Information to Assist Fault Localization" [Jones et al, 2002] by James Jones, Mary Jean Harrold, and John Stasko introducing Tarantula and its visualization. The paper won an ACM SIGSOFT 10-year impact award.
• "Bug Isolation via Remote Program Sampling" [Liblit et al, 2003] by Ben Liblit, Alex Aiken, Alice X. Zheng, and Michael I. Jordan, introducing the term "Statistical debugging". Liblit won the ACM Doctoral Dissertation Award for this work.

The Ochiai metric for fault localization was introduced by [Abreu et al, 2009]. The overview by Wong et al. [Wong et al, 2016] gives a comprehensive overview on the field of statistical fault localization.

The study by Parnin and Orso [Parnin et al, 2011] is a must to understand the limitations of the technique.

## Exercises¶

### Exercise 1: A Postcondition for Middle¶

What would be a postcondition for middle()? How can you check it?

### Exercise 2: Statistical Dependencies¶

Using the dependencies from the chapter on slicing, can you determine which specific data or control dependencies correlate with failure?

### Exercise 3: Correlating with Conditions¶

In HTML, it is permissible that tag attribute values can also have single quotes, as in

<b class='extrabold'>abc</b>


Such attributes are actually treated correctly by our remove_html_markup() code.

#### Part 1: Experiment¶

What happens if test inputs with single quote attributes become part of our test set? How does statistical fault localization change?

#### Part 2: Collecting Conditions¶

Instead of aiming for lines that correlate with failures, we can look at individual branch conditions such as c == "'", c == '"', or tag that correlate with failures. In the above case, the condition c == "'" would correlate, whereas c == '"' would not.

Reusing the code instrumentation from the chapter on slicing, collect the individual values of Boolean conditions in tests during execution in a ConditionCollector class. Show the event table.