Automatic Testing: Battleships Anyone?

In the last article, we made the case for augmenting testing with automatic algorithms to complement human effort. We explored what automatic methods could do, looking only at the source code and looked at the problems they could solve for us. This article picks up where we left off, and explains how we can analyse programs and try to find inputs to achieve code coverage automatically.

A Simple Programming Language

To start with, we'll use the simplistic programming language from the previous article that demonstrates how we can analyse a program to transform the activity of generating inputs into a search problem. Our language has the familiar constructs: branching (if statements), relational operators (equality, greater than, less than), and a return statement. The language has variables too, the values of which might be set elsewhere (but can't change).

			Please enable JavaScript to view the examples.
Figure 1: An example program

The example has a small amount of behaviour. We control three inputs, a, b, and c and one of three things will happen depending on that input. If the three inputs are the same, we will reach the first return statement. Otherwise, we will reach one of the two statements depending on the value of the inputs.

This example has a few things wrong with it. Firstly, the implementation is wrong: triangles are misclassified. There also aren't enough checks to make sure that the side lengths make a triangle. However, we can't see these faults by looking at the code in isolation. We only know about the faults because we know what a triangle is.

If the code can't tell us if it's broken, what does it tell us? We can take two different 'views' of code. The first perspective is 'static'; we run a tool that analyses the contents of the source file and looks for problems. Static analysis tools such as PVS-Studio and Coverity can uncover potential bugs using detection rules and data flow analyses. The other perspective is 'dynamic'. Instead of reading the source code, we watch what the program does, analysing it 'dynamically'. These approaches are complementary: they are both adept at finding particular flaws.

Dynamic analysis is well suited to problems where the source code can't be analysed, or where the results can't be easily modelled (more on models later). In this example, we'll use dynamic analysis to cover 100% of the decision points in the program. We can enumerate the paths through the program statically by tracing each branch. While it's straight forward to get the list of paths, generating inputs that follow those paths is more difficult.

			Please enable JavaScript to view the examples.
Figure 2: The path to the goal
To begin, suppose we want to reach the statement . In order to do that, we'll need to find a path through the program that allows us to reach that statement. We can compute the path by collecting the branches taken to reach the target, shown in the box on the left.

At this stage, we have a target, and we know the path we need to take to reach it. However, at each decision point in the path, there is a test we must satisfy, that our input values allow us to advance. This is where search and computing power come into the fold.

Finding Inputs

At each step along the path, there is a condition that we must solve. We can't simply try to solve each in turn, as values that satisfy one condition might conflict with later conditions. There are two key ways we can try to solve these conditions.


We can take the condition for the branch and analyse it symbolically. This works by taking each of the conditions from the branches, formulating it as a mathematical expression, and reasoning about it using, for example, an SMT solver like Z3. This is easy for simple, linear constraints, such as a == b, but becomes more difficult as the math gets harder and the constraints get larger. Pex (used in Visual Studio 2015's Smart Unit Tests) uses this approach and relies on a comprehensive set of "models" for various arithmetic, string, and IO operations that allow it to generate test cases for more complicated programs.


Instead of modelling the conditions and solving the constraint analytically, we can take a "scientific" approach. We try inputs, and use a modified program that replies with information on how close we were to the branch we targeted. Armed with that information, and information about previous attempts, we can try new values that might be more likely to reach the target. This approach is simpler to implement, and has the advantage that it only requires a minimum of understanding about the code to generate input data. Evosuite, and to an extent, fuzzers such as AFL, use this approach.

The program needs to tell us how close each input we try is to our target. In the example, we can easily tell if we reached a point on the target path by stepping through the program. However, if we miss the goal, we don't have the feedback we need to make the next guess. Unless we do something more intelligent, we can only make random guesses.

We get the feedback we need by modifying the program so that we measure the proximity of our inputs based on the difference between the input and the value that satisfies each constraint. When this is maximised, we've reached our target. We've gone from playing battleship ("hit" versus "miss") to playing Marco Polo (the feedback tells us where the target is compared to where we are).

In the case of our program, we can transform constraints into functions of the input variables, then search for inputs that maximise these functions. However, we must be careful that our transformation of values produces the right shape and gives us that proximity measure.

Figure 3: Detailed and undetailed feedback (Click here to switch feedback)
The example shows two encodings of the constraint x == 10. If you toggle the function, you can see the difference between the two. One produces the maximum value (100) only when x is 10, whereas the other takes the absolute value of 10 - x and subtracts it from 100, giving a slope towards the best value. Sloped shapes are ideal for search algorithms, as they give them a clue where to try next. In contrast, a spike at one value is hard to find without relying on luck.


Hitting Targets

Figure 4: Building up a picture with random guesses
In the search process we start with no information: we have to make a completely random guess. We can repeatedly make guesses across the input space to get an idea for where the values that reach our target are. The animation shows how this works. After a few random guesses, the "shape" becomes apparent. Instead of guessing randomly, we can use the shape to make a more informed choice. It's important that we don't have to understand the process that generates this shape; we can still reach the target without understanding the relationship between inputs and outputs.

There are multiple algorithms we can use to try to find these inputs. In our example, we only have one input which simplifies the problem a great deal. To find the maximum, there are a few strategies we can employ. The simplest is a random search, as used in the example above. If we wait long enough, we'll probably find the answer. Instead of relying on serendipitously finding the answer, we can be more methodical.

Figure 5: A simple search algorithm (switch feedback)
The example in Figure 5 shows how the optimum can be reached. We start from a random point, then move in the direction that produces the highest increase in fitness. If you toggle the feedback function, you can see how a more intelligent approach is able to find the maximum value by moving in whichever direction that yields an increase from the previous guess.

As the naive function is flat with the exception of the one good value, the search will give up (unless it randomly chooses it as its starting point). In contrast, using the absolute value of the distance gives a slope that is easy to traverse.

Covering Branches

Now we have a way to extract the constraints from each branch, plus an approach to represent constraints to run a searching algorithm. The final step is to combine them. In cases where we have multiple branches, we want to find values that satisfy both equations simultaneously. In a search-based context this is simple: we just aggregate the individual functions.

In practice, more advanced searching algorithms are used that reduce the number of inputs tried. There are a variety of approaches that can be used to minimise the time wasted exploring useless inputs, such as pattern search, which increases the stride it takes as long as fitness values increase. In principle, however, these algorithms are very similar to one another; at the very least they share the same goal.


This article showed how program analysis allows us to establish targets and generate feedback we can use to search for inputs. The next article in this series will expand on search algorithms, showing how the approach described here can be taken into practice by generating tests for our program with a more sophisticated search algorithm.

Read More

National Software Testing Conference 19th-20th May 2015

Location: The British Museum
Date: 19th-20th May
Registration: National Software Testing Conference

On the 19th-20th May the ACRC will be exhibiting alongside Broker@Cloud an FP7 EU project looking to create a framework of quality assurance and optimisation for cloud environments.

Being held for the second year, located at the British Museum in central London some of the most experienced minds in the industry will be presenting and demonstrating, “Real-world’ and case studies from organisations crowned at The European Software Testing Awards. The National Software Testing Conference 2015 targets those that are serious about the process and delivery of software testing and quality. Last year alone saw the attendance of Head of Test Delivery & Functionality, Lloyds Bank Plc, Head of Test Design and Consultancy, Home Office, Head of Testing, Royal Sun Alliance to name a few.

The ACRC stand will host our testing experts Dr Phil McMinn, and Dr Mathew Hall along with Broker@Cloud’s Dr Anthony Simons. So come down and say hello to us and take a look at our demonstrations.

Read More

Automatic Software Testing: Rise of the Machines?

Automatic software testing has been a long-standing dream of many industrial practitioners and researchers alike. However, the state of practice is behind these advances in most industries. This series of articles will delve into the latest research, showing where automatic search-based techniques can fit into existing testing processes. First, we’ll explain the motivation for making computers test (or check) software, and show where they can be complementary to human testers.


Software testing is crucial. There is a fruitful supply of examples of failures caused by software that could have been prevented with better testing procedures. One of the canonical examples used in Software Engineering classes is Ariane 5’s maiden voyage. The component that triggered the failure was reused from Ariane 4, but was not tested for Ariane 5’s different flight path. Simulations were found to demonstrate the failure, unfortunately only after the launch and the rocket’s subsequent self-destruction.

Testing isn’t easy. Failures can manifest in complex interactions between various components. One of the aims of software testing is to provide assurance that these interactions won’t cause failures when the system is deployed. As creative as humans are, the number of cases to consider makes the game of guessing which interactions might erroneously fail significantly challenging.

In complex systems, even after failure, the exercise of finger-pointing is often tricky. In his “Normal Accidents” book, Charles Perrow deconstructs many failures of complex safety critical systems to show, worryingly, that systems are highly likely to fail as a result of complexity.

This almost makes building systems seem like a futile task, however, the show must go on, so to speak. Most of our actions aren’t risk free, but we still have to have some appetite for risk. Testing serves to reduce risks by building confidence that the product behaves as it should, and identifying defects when they are least troublesome (i.e. before the customer has seen them).

Unfortunately, budgets constrain the amount of testing that can go into managing a project’s risk. As a result, testing must be delicately prioritised to deliver the best risk reduction possible. This might involve forgoing, or reducing the depth of, tests for less critical components to ensure that significant failures are protected against, and that common functionality is available.

Creativity versus Methodology

Human creativity is a great way to look for bugs. Given enough time, a human has a chance of finding all the failure modes of a piece of software. Unlike a human, however, computers aren’t creative and don’t get tired or bored. Part of testing involves tedious repetition of processes over multiple configurations. These are tasks for which computers are well-suited, hence the rise of automation in the software QA industry, with the aim of increasing the time testers can spend creatively trying to locate failures.

As useful as automation is, the main barrier to adoption stems from the need for a human to tell the computer what to do. Scripting is a technical skill that many testers don’t need, so it’s often unreasonable to ask them to don a developer’s hat and write some code; code that they’ll undoubtedly have to maintain.

A large body of software testing research is directed at a problem that requires no human input. As we discussed previously, it’s important for testing to cover as much of the behaviour of the product as possible to reduce the risk of flaws. Unlike automation, uncovering inputs to programs (at the unit level) can be made automatic.


In the academic software testing field, work is often evaluated on its ability to maximise “coverage”. When we talk about coverage, we often use it in specific terms, such as code coverage for test adequacy, but these are all proxies for an unattainable notion of the coverage of the behaviour of the program our test suite produces. In other words, it’s how confident we can be that the system is unlikely to exhibit faulty behaviour, because our test suite has (should) already allowed us to observe it.

With an appropriate method of measuring the coverage of a test suite, we can build a feedback loop in which an algorithm constructs some tests (checks), then tweaks them in response to the coverage they attain. This loop is fundamental to a lot of automated systematic testing tools, such as EvoSuite, KLEE, and Code Digger (a.k.a Pex) to name but a few.

Unlike humans, it’s feasible to ask machines to spend hours trying different inputs to reach different parts of the code. Combined with code analysis, testing tools can automatically find inputs that exercise the hard-to-reach corner cases in the program for a small investment in compute time.

if (a==b){
if (a==c){ return 'Scalene' }else{ return 'Isosceles' }
}else{ return 'Equilateral' }

To illustrate that difficulty, there are a couple of bugs in the example. The code should, if it were correct, classify a triangle given three of its sides, in the variables a, b, and c. The first bug is fairly easy to spot; a triangle with three equal sides (an equilateral triangle) is incorrectly classified as scalene. The second is caused by missing code; impossible triangles (such as 1, 2, 3) are not distinguished between physically possible triangles.

Finding Faults

Revisiting the notion of coverage, we can analyse the example above to estimate how good a given test set is. There are three logically different cases, so if we have a test case for each we can be sure we’ve tested all the behaviour of the program. This is the ideal application of computation. If these constraints were much more complicated, it would be unrealistic for a human to come up with every possible interaction. A list of them identified by automated testing would be a useful shortcut. The tester could then check each of the cases and decide if the outputs are right for the given inputs.

Going back to the bugs, however, there’s a bug that doesn’t arise because of the conditions already in the program. Impossible triangles (such as a straight line with sides 1,2,3) aren’t rejected. Even with total coverage of the branches in the program, without knowing what a triangle is, a computer is unlikely to come up with a test for it. This (often creative) application of knowledge and ability to ask questions in the form, “what should the program do?” instead of simply, “what does this program do?” are the things that are hard to automate. This is where the complementary relationship between human and computer arises for software testing.

Having dealt with the “whys” of using automated techniques, the next article will start to answer the “hows”, explaining how we can translate supplying input to programs into a problem a computer can try to solve.

Read More