Skip to content

Kanban Board Enhancement Proposal

January 31, 2014

Too much information kills information. — Proverb

The Kanban board was initially used to visualize the workflow of a production process but it can also be used for knowledge work like the software development process. A simple board can be implemented with sticky notes on a wall but it can also be implemented in software with tools like Kanbanery.

The objective of this post is to propose enhancements for the Kanban board that could potentially improve the user experience. The enhancements result from analyzing the properties of the board using principles from the Semiology of Graphics.

Graphical Analysis

Open Active Closed

Item 1
Item 2
Item 3
Item 4

The information in a Kanban board consists of the work items assigned to a team for the duration of a sprint (or iteration). This complete and unchanging definition common to all data points is called the invariant. Each work item must at least specify a summary and status, but it can also specify a priority, owner, estimate, tag, etc. In turn, these variations between items are called the variables (or components).

A variable has a level of organization that is fundamental for reordering, comparing and grouping information. The level can be characterized as:

  • Qualitative: Like the owner of a work item, it doesn’t have a universal order so it can be reordered arbitrarily for information processing purposes.
  • Ordered: Like the status, it has a single order from open to closed to convey the progression of a work item.
  • Quantitative: Like the estimate, it is possible to say that an item is estimated to take twice as long as another item.

The board is a two dimension plane containing a few geometric elements, each representing some information:

  • A point indicates a position on the plane with no theoretical length or area, it can be the size of an avatar to represent the owner of a work item.
  • A line has a start and end position with a measurable length but no area, it is primarily used to evaluate the number of work items under a status column.
  • An area is composed of points linked by lines which delimit a part of the plane, it is used to represent individual work items capable of containing more information.

The values assigned to a variable are represented with images that reflect different levels of classification. The separation of values, like the different owners, can be expressed by:

  • Color: Range of all color sensations which are not always perceived by everyone.
  • Orientation: Hatching of lines varying in orientation from horizontal to vertical.
  • Shape: Given the same surface, an image can have many different shapes.

The ordering of values, like some priorities being higher than others, by:

  • Grain: Hatching of varying lines ranging from fine to coarse-grained.
  • Value: Intermediate colors between black and white.

The quantity of values, like estimates in hours, by:

  • Size: Height of a column, surface of a line or a quantity of equal images.

Open Items

At the beginning of a sprint, work items are all open and appear under the left column of the Kanban board. This is the same effect as a degenerate binary tree where all the nodes form a linked list. For team members, as for computers, this is inefficient to choose the next item to work on.

An enhancement would be to move the open items to another board. Instead of using the status as columns, another variable could balance the items across multiple shorter columns. If the items are tagged, it might be more efficient to choose from a column with items of the same tag. The problem is that open items with a lower priority might be chosen from one column before other items with a higher priority in another column.

Tag 1 Tag 2 Tag 3

High Priority
Item 1
Item 2

Low Priority
Item 3
Item 4

This could therefore be extended with areas ordered by priority spanning all tags, as illustrated on the right. The items from the highest priority area are clearly visible while keeping an overview of the remaining work items. Another problem is that fragmenting the board breaks the complete workflow which is core to understanding how the work proceeds.

The actual benefit depends on whether the gain of choosing work items more effectively outweighs the cost of breaking the workflow.

Closed Items

During the course of a sprint, more work items are closed and appear under the right column of the Kanban board. This gives the impression that the length of the column can be compared with the length of other columns to evaluate the amount of work completed so far. This might be misleading because the length is proportional to the number of work items rather than the hours of work.

An enhancement would be to change the representation of work items from punctual elements without a theoretical length to line elements with a length relative to the estimate. By adding these elements under the status column, the length becomes proportional to the total hours of work. The problem is that the estimate for the closed items might be different from the actual time spent which might affect completing the sprint on target.

80% 100% 120%

Item 1
Item 2
 
Item 3
Item 4

This could be extended by moving the closed items to another board where the columns would be the relative difference between the estimate and the time spent, as illustrated on the right. The columns should look like an upside down normal distribution where skewness to the right indicates the sprint might not be completed on target. This suffers from the same problem as before of breaking the complete workflow.

Another enhancement might be to supplement the Kanban board with a burn down chart (or burn up chart). The benefit is that the chart is like a time series that displays the progress of the whole sprint whereas the Kanban board is more like a data map that only displays a snapshot of the current state.

Summary

This post proposed enhancements for a couple of areas in the Kanban board; open and closed work items. The enhancements sometimes solved problems only to be replaced by other problems. So, the benefit of their adoption actually depends on context.

The enhancements also attempted to extend the board with more variables which ultimately resulted in more boards. So, adding more variables doesn’t necessarily add more information; it might actually kill information.

Incremental Collaborative Filtering in SQL

November 13, 2010

Collaborative filtering is typically about getting recommendations by people with similar interests from within a larger group. This process has been popularized by sites like amazon.com providing personalized recommendations based on purchase history.

This post proposes a rating-based implementation of collaborative filtering within a couple constraints. First, the ratings must be calculated and persisted in an SQL database as opposed to residing in memory. Second, the recommendations must remain current at all times by maintaining ratings incrementally. This should be contrasted with other systems where recommendations are calculated periodically in a batch process.

History

In 1992, David Goldberg first used the term collaborative filtering in his paper Using collaborative filtering to weave an information tapestry. At the time, he implemented a system which enabled users to annotate documents as either interesting or uninteresting. These ratings were then used to calculate recommendations for the active user which falls under the category of user-based filtering.

In 2000, Vucetic and Obradovic proposed an alternative to address the inherent performance problems with user-based filtering in their paper A Regression-Based Approach for Scaling-Up Personalized Recommender Systems in E-Commerce. The most significant difference is taking an item-based approach by inferring interest of the current user from the relationships between each pair of items.

In 2005, Daniel Lemire and Anna Maclachlan proposed another alternative to address the computational complexity of existing predictor functions in their paper Slope One Predictors for Online Rating-Based Collaborative Filtering. The difference is that functions typically use linear regression to predict the ratings which can suffer from a large number of regressors whereas the Slope One predictor is sometimes faster while needing only half the regressors.

Tables

To provide concrete examples, PostgreSQL will be used and these statement will create the initial tables to represent people and items:

CREATE TABLE person (id SERIAL NOT NULL PRIMARY KEY, name TEXT NOT NULL);
CREATE TABLE item (id SERIAL NOT NULL PRIMARY KEY, name TEXT NOT NULL);

Next, the rating table is created to represent the value of ratings collected by people across a range of items. The data type of the value column is defined as real to store single precision floating-point numbers but this could be changed to any numerical type:

CREATE TABLE rating (
    person_id INTEGER NOT NULL REFERENCES person,
    item_id INTEGER NOT NULL REFERENCES item,
    value REAL NOT NULL,
    PRIMARY KEY (person_id, item_id));

Finally, the frequency table is created to represent the popularity differential between each pair of items resulting in a square matrix. The size of the table can be reduced by observing that the lower triangular matrix is actually the negative transposition of the upper one resulting in a skew-symmetric matrix. The size can be further reduced by removing the main diagonal because the value in the sum column will always be zero. These optimizations can be enforced using a constraint to make sure that rows are only inserted in the lower triangular matrix:

CREATE TABLE frequency (
    item1_id INTEGER NOT NULL REFERENCES item,
    item2_id INTEGER NOT NULL REFERENCES item,
    count INTEGER NOT NULL DEFAULT 0,
    sum REAL NOT NULL DEFAULT 0.0,
    PRIMARY KEY (item1_id, item2_id),
    CONSTRAINT valid_item_id CHECK (item1_id > item2_id));

If the size of the table is still too large, it would also be possible to only insert rows in the matrix above a certain threshold resulting in a sparse matrix. The tradeoff is that the triggers to update the table become more complex because some entries might be missing. So, the right solution really depends on context.

Triggers

Database triggers can be used to automatically maintain integrity of the frequency table when changes are made to the other tables. To define the corresponding procedures, PL/pgSQL will be used to continue expressing statements in SQL.

The first trigger consists of initializing the lower triangular matrix of the frequency table when a row is inserted in the item table. Even though this is expressed in a single statement, this results in i-1 rows where i is the current number of items which implies that the total size of this table is i(i-1)/2. So, this is where a sparse matrix might reduce storage and improve performance:

CREATE OR REPLACE FUNCTION item_insert() RETURNS TRIGGER AS $$
BEGIN
    INSERT INTO frequency (item1_id, item2_id) SELECT NEW.id, id FROM item WHERE id <> NEW.id;
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER item_insert_trigger AFTER INSERT ON item
    FOR EACH ROW EXECUTE PROCEDURE item_insert();

The next triggers consist of maintaining the count and value columns of the frequency table when changing the rating table. The complexity mostly resides in having to update each pair of items for the person in the rating. Assuming the rows have already been inserted in the frequency table by the above trigger, this results in a single statement updating at most r-1 rows where r is the number of ratings by the person in the rating:

CREATE OR REPLACE FUNCTION rating_insert() RETURNS TRIGGER AS $$
BEGIN
    UPDATE frequency SET count = count + 1, sum = sum + r2.value - r1.value
    FROM rating r1, rating r2
    WHERE r1.person_id = r2.person_id AND r2.person_id = NEW.person_id
    AND ((r1.item_id = NEW.item_id AND r1.item_id = item2_id AND r2.item_id = item1_id)
        OR (r2.item_id = NEW.item_id AND r2.item_id = item1_id AND r1.item_id = item2_id));
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER rating_insert_trigger AFTER INSERT ON rating
    FOR EACH ROW EXECUTE PROCEDURE rating_insert();

The corresponding triggers for UPDATE and DELETE events on the rating table are quite similar; the only difference is the content of the set clause.

Recommendations

At this point, the frequency table only contains part of the matrix. Another part is the main diagonal which is actually superfluous because the value in the sum column will always be zero. The last part is the upper triangular matrix which can be computed as the negative transposition of the lower one:

CREATE VIEW frequency_matrix (item1_id, item2_id, count, sum) AS
    SELECT item1_id, item2_id, count, sum FROM frequency
    UNION SELECT item2_id, item1_id, count, -sum FROM frequency;

The resulting matrix can be used to provide non-personalized recommendations which don’t depend on a particular person. For example, the following statement returns an ordered list of related items for the given $item_id based on the ratings from all people:

SELECT item2_id AS item_id, (sum / count) AS average FROM frequency_matrix
    WHERE count > 0 AND item1_id = $item_id ORDER BY average DESC;

The same matrix can also be used to provide personalized recommendations for a given person. For example, the following statement returns an ordered list of the most relevant items for the given $person_id:

SELECT f.item1_id AS item_id, sum(f.sum + f.count * r1.value) / sum(f.count) AS average
    FROM frequency_matrix f JOIN rating r1 ON r1.item_id = f.item2_id
    LEFT JOIN rating r2 ON r2.item_id = f.item1_id AND r2.person_id = r1.person_id
    WHERE r1.person_id = $person_id AND f.count > 0 AND r2.item_id IS NULL
    GROUP BY f.item1_id ORDER BY average DESC;

Note that the above statement contains an anti-join query to exclude items already rated by the person which can be expressed with NOT IN, NOT EXISTS or LEFT JOIN. The latter operation is recommended because it usually results in a more efficient execution plan by the database optimizer.

Summary

This post has provided an implementation of collaborative filtering in SQL which can be maintained incrementally. The most significant bottleneck seems to be the order of growth for the frequency table which might have both storage and performance implications. If this proves to be a problem, the table can be further optimized by using a sparse matrix.

Performance Testing in PostgreSQL

June 28, 2010

You can’t control what you can’t measure — Tom DeMarco

Testing performance in general measures how fast some aspect of a system performs under a particular workload. The scope of this post consists of limiting the workload to the execution of SQL statements in PostgreSQL.

In the end, the objective is to understand how to accurately test performance while mitigating side effects on both the client and server sides.

Test Plan

A test plan is a document detailing a systematic approach to testing the performance of a system. The most obvious part of the plan defines test cases which can be expressed as simply as SQL statements.

Another part of the plan defines the environment requirements which usually consist of the hardware and software used for testing. However, this should also attempt to define potential side effects on the server side which might impact performance, such as processes running periodically or the state of the buffer cache:

# CREATE TABLE foo (id SERIAL NOT NULL PRIMARY KEY, value INTEGER NOT NULL);
# INSERT INTO foo (value) SELECT i FROM generate_series(1, 1000000) i;
# \timing
# SELECT * FROM foo WHERE value = 1;
Time: 124.278 ms
# \! sudo sync
# \! echo 3 | sudo tee -a /proc/sys/vm/drop_caches
# SELECT * FROM foo WHERE value = 1;
Time: 265.871 ms

In order to mitigate the impact of the environment on performance, the test execution part of the plan should make the necessary provisions for running multiple times. So, in the final result analysis part, it could then be possible to produce mean or median values of the performance.

Cost

The cost in PostgreSQL is the planner’s guess at how long it will take to execute a statement measured in units of disk page fetches. To obtain the cost of a statement, prepend the explain command to the statement:

# CREATE TABLE foo (id SERIAL NOT NULL PRIMARY KEY, value INTEGER NOT NULL);
# INSERT INTO foo (value) VALUES (0);
# EXPLAIN SELECT * FROM foo WHERE value = 0;
Seq Scan on foo  (cost=0.00..36.75 rows=11 width=8)

If the test plan consists of repeatedly inserting rows, this may impact the cost of later statements. Furthermore, if the test plan deletes rows after they are inserted, this may also impact cost:

# INSERT INTO foo (value) SELECT i FROM generate_series(1, 1000000) i;
# DELETE FROM foo;
# INSERT INTO foo (value) VALUES (0);
# EXPLAIN SELECT * FROM foo WHERE value = 0;
Seq Scan on foo  (cost=0.00..4425.01 rows=1 width=8)

Even though the cost is more than a hundred times higher, the execution time is actually comparable. So, the concept of cost is not necessarily an accurate measure of performance, it is mostly useful when compared against itself on the same version of PostgreSQL.

Timing

The timing is the actual wallclock time measured in milliseconds to execute a statement. To enable timing of all statements, simply enter the \timing switch and make sure the current status is on:

# \timing
Timing is on.

As opposed to cost, the load of the database and the overall system may impact the timing of statements:

# CREATE TABLE foo (id SERIAL NOT NULL PRIMARY KEY, value INTEGER NOT NULL);
# INSERT INTO foo (value) SELECT i FROM generate_series(1, 1000000) i;
# SELECT * FROM foo WHERE value = 1;
Time: 126.964 ms
# \! sudo updatedb &
# SELECT * FROM foo WHERE value = 1;
Time: 195.682 ms

Even though timing provides a more accurate measure of performance than cost, the result must also be obtained from the interactive terminal. The side effect on the client side is that this makes it difficult to define test cases which must also parse this information.

Logs

Instead of using the interactive terminal, the logs can also be used for timing statements. To enable logging of timing information for all statements, set the following configuration variable in postgresql.conf and make sure to restart the postgres service after applying the changes:

log_min_duration_statement = 0

The separation of timing information from test execution essentially removes the side effect on the client side which no longer needs to parse additional information. This makes it simpler to define test cases which can now be executed from anywhere. For example, the following test cases can be executed a hundred times directly from the command line to measure the performance of inserts:

$ echo "CREATE TABLE foo (id SERIAL NOT NULL PRIMARY KEY, value INTEGER NOT NULL);" | psql
$ for i in `seq 100`; do cat >>'EOF' | psql; done
INSERT INTO foo (value) VALUES (0);
INSERT INTO foo (value) VALUES (0), (1);
INSERT INTO foo (value) VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9);
INSERT INTO foo (value) VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)... (99);
EOF

The logs can then be parsed and analysed to produce reports as defined in the test plan. For example, the following report provides the median values for running each test case:

2.102 ms: INSERT INTO foo (value) VALUES (?);
1.295 ms: INSERT INTO foo (value) VALUES (?), (?);
1.314 ms: INSERT INTO foo (value) VALUES (?), (?), (?), (?), (?), (?), (?), (?), (?), (?);
2.982 ms: INSERT INTO foo (value) VALUES (?), (?), (?), (?), (?), (?), (?), (?), (?), (?)... (?);

The problem with this report is that the first statement seems to have a timing inconsistent with the other statements. After executing the test cases in the reverse order, the same problem occurs for the first statement which has a timing approximately 0.8 ms higher. Even though logs mitigate some side effects on the client side, the order when executing test cases might also have side effects.

Summary

This post demonstrated how side effects can significantly impact accuracy when testing performance. To mitigate these side effects on the server side, executing test cases multiple times makes it possible to produce mean or median values of the performance. On the client side, using logs instead of the interactive terminal makes it simpler to define test cases. However, there are so many possible side effects that critical attention to inconsistencies must be given throughout the test plan.

INSERT IGNORE in SQL

June 21, 2010

When using INSERT on a table containing UNIQUE or PRIMARY KEY constraints, a duplicate row causes an error and the statement is aborted. Sometimes, it is desirable to IGNORE the duplicate error and return the row when it exists already.

The objective of this post is to explore techniques to INSERT IGNORE rows while adhering to the SQL-92 standard as opposed to relying on custom features only supported by some databases.

SQL

Before considering client or server side techniques, lets concentrate on simply using SQL statements. To provide concrete examples, PostgreSQL will be used and this statement will create a table containing both UNIQUE and PRIMARY KEY constraints:

CREATE TABLE foo (id SERIAL NOT NULL PRIMARY KEY, name TEXT NOT NULL UNIQUE);

The first technique for implementing INSERT IGNORE consists of using a SELECT statement that conditionally supplies a single row to be inserted:

INSERT INTO foo (name) SELECT 'a' WHERE NOT EXISTS
(SELECT 1 FROM foo WHERE name = 'a') RETURNING foo.id;

The second technique builds upon the previous one by reusing the SELECT statement that conditionally combines multiple rows to be inserted at once. Note that the GROUP BY clause is only necessary if there is a possibility that duplicate values might be provided in the statement itself. Without this clause, this will cause a duplicate error if the given row doesn’t exist already:

INSERT INTO foo (name) SELECT v.* FROM (VALUES ('a'), ('b')) AS v(name)
LEFT JOIN foo f ON (f.name = v.name) WHERE f.name IS NULL GROUP BY v.name RETURNING foo.id;

In both techniques, the statement returns a result only when one or more rows are inserted, otherwise nothing. The problem is that another statement would therefore be necessary to select the rows which existed already.

Client Side

On the client side, the requirements are often different than strictly using straight SQL statements. For example, it is not unreasonable to expect a function to return the row which is either inserted or existed already. To provide concrete examples, Storm will be used and this definition will map the above SQL table to a native Python class:

from storm.locals import *
class Foo(object):
    __storm_table__ = "foo"
    id = Int(primary=True, allow_none=False, default=AutoReload)
    name = Unicode(allow_none=False)
    def __init__(self, name):
        self.name = name

The most obvious technique for implementing INSERT IGNORE consists of attempting to find an existing object and, if nothing is returned, insert a new object. However, this technique contains an unfortunate race condition if two concurrent sessions attempt to find the same object at the same time, so both sessions will attempt to add the same object and one will innevitably cause a duplicate error:

def insert_ignore_foo(store, name):
    row = store.find(Foo, name=name).one()
    if not row:
        row = store.add(Foo(name))
        store.flush()
    return row

A more reliable technique consists of attemping to add the object first and, if that causes a duplicate error, then find the object which innevitably exists already. The problem with this technique is that performance might be affected if duplicate objects are expected to be encountered often because the add call is attempted first:

def insert_ignore_foo(store, name):
    try:
        row = store.add(Foo(name))
        store.flush()
    except IntegrityError, e:
        store.rollback()
        row = store.find(Foo, name=name).one()
    return row

By combining the above functions, it is possible to benefit from both reliability and optimal performance when duplicate objects are encountered often:

def insert_ignore_foo(store, name):
    row = store.find(Foo, name=name).one()
    if not row:
        try:
            row = store.add(Foo(name))
            store.flush()
        except IntegrityError, e:
            store.rollback()
            row = store.find(Foo, name=name).one()
    return row

In the best case, these functions result in a single roundtrip to the database. In the average case, the same can be accomplished if the appropriate function is chosen based on the access patterns. The only limitation is that inserting multiple rows cannot be accomplished in a single roundtrip.

Server Side

On the server side, it is possible to improve performance over the client side by reducing the overhead of the roundtrips to and from the database. Note that each roundtrip requires serializing Python objects to the database and then unserializing the results returned from the database.

The first technique consists of creating a trigger before insert on the table which will skip the operation when the row exists already. The problem with this approach is that skipping the operation prevents it from using the RETURNING clause which basically changes the behavior of the INSERT statement when the row exists already:

CREATE OR REPLACE FUNCTION insert_ignore_foo() RETURNS trigger AS $$
DECLARE r foo;
BEGIN
    SELECT INTO r * FROM foo WHERE name = NEW.name;
    IF found THEN RETURN null; END IF;
END
$$ LANGUAGE plpgsql;
CREATE TRIGGER foo_insert_trigger BEFORE INSERT ON foo
    FOR EACH ROW EXECUTE PROCEDURE insert_ignore_foo();

Another technique also consists of creating a similar trigger which always attempts to delete the row before inserting it. This solves the problem of changing the behavior of the INSERT statement but introduces another problem relating to existing entries which might suddenly be deleted. This might not be considered acceptable in some situations where foreign key references point to rows in the table but might actually be acceptable in other situations:

CREATE OR REPLACE FUNCTION insert_ignore_foo() RETURNS trigger AS $$
BEGIN
    DELETE FROM foo WHERE name = NEW.name;
END
$$ LANGUAGE plpgsql;
CREATE TRIGGER foo_insert_trigger BEFORE INSERT ON foo
    FOR EACH ROW EXECUTE PROCEDURE insert_ignore_foo();

The most recommended technique on the server side is probably to create a function instead of a trigger. This function essentially implements the same lessons learned on the client side:

CREATE OR REPLACE FUNCTION insert_ignore_foo(n TEXT) RETURNS foo AS $$
DECLARE r foo;
BEGIN
    LOOP
        SELECT INTO r * FROM foo WHERE name = n;
        IF found THEN RETURN r; END IF;
        BEGIN
            INSERT INTO foo (name) VALUES (n) RETURNING * into r;
            RETURN r;
        EXCEPTION WHEN unique_violation THEN NULL; END;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

The server side actually happens to be quite similar in performance as the client side in terms of resulting statements. Even the limitations are the same in terms of inserting multiple rows. So, the only potential benefit is reducing the serialization and unserialization overhead in the worst case where the client side degrades to more than one roundtrip on average.

Summary

SQL statements provide a clear advantage when inserting multiple rows which can potentially result in a single statement. However, there is no clear advantage between client and server side functions when inserting a single row.

Streaming Test Results

June 29, 2009

The previous post featured the Subunit test framework which supports streaming of test results. The output is expressed in a structured format which can be filtered into other formats such as pyunit for example.

The objective of this post is to explore the requirements on formats to support streaming of test results. The motivation is that the streams produced by diverse test suites could then be filtered into a consistent output which is essential in the context of integration.

Structured Data

Basically, test results are nothing more than structured data. In its simplest form, this consists of representing information in such a way that it can be written and then read in a consistent way. In the widest interpretation of the definition, this can be encountered when defining data types in the source code of a programming language and having it parsed by the corresponding compiler. In a narrower interpretation, test results should be expressed in such a way that they could be easily be understood by a variety of programming languages.

The point is that data should be written in a structured format to be read by other programs rather than human beings. This might seem obvious, but it is worth mentioning explicitly because there are several test suites which don’t apply this principle by default.

Data Formats

The format of data consists of its intermediate representation between the time it is written and later read. XML is a popular example often used in the context of Web Services, where it represents the message format when exchanging data between a requester and a provider. This context is particularly relevant in respect to integration because the requestor and the provider can be written in different programming languages. In theory, integration is seemingly simple because the format has been standardized, so a parser is readily available in most contemporary programming languages. In practice, integration is rarely simple.

In addition to standardized formats, it is also worth mentionning some of the benefits of specialized formats. For example, the Subunit test framework defines its own output format for test results which happens to be trivial to read and write in just about any programming language. Even in a shell script, the output can be parsed using the most basic regular expressions.

So, even though there are benefits to adopting standardized formats, there are also good reasons to use specialized formats. The relevant part is that they should be trivial to parse.

Meta Information

In addition to the raw data, a format can also specify meta information which is essentially “data about the data”. Here are the most common approaches to specifying such information:

  • Some data formats do not contain any meta information and simply express all raw data as strings. For example, this is the case for current version of the Stomp protocol. It is therefore the responsibility of the application layer to explicitly coerce everything into the corresponding types and data structures.
  • Other formats only encode type information within the data itself. For example, it is possible to consistently parse the YAML format while preserving type information regardless of the programming language. However, it is still the responsibility of the application layer to validate the structure of the data.
  • Finally, another approach consists of decoupling the raw data from the meta information. For example, Web Services typically expose an XML schema which can define extensive information about the data such as type and structure. So, when the application layer finally receives the data, it can reasonably assume that the data has been validated for most intents and purposes.

A priori, it would seem that encoding type information within the data might be desirable. However, if the structure of the data still needs to be validated by the application layer, this actually provide little gain. In other words, there is little effort in validating the type of the data if the structure has to be validated anyways.

Interoperability

The IEEE defines interoperability as “the ability of two or more systems or components to exchange information and to use the information that has been exchanged.” In the context where the components consist of test suites and the information exchanged consist of test results, the implication is that the output should adhere to a common information exchange reference model. In other words: what is sent should the same as what is understood.

The most obvious approach would require that all test suites output the same data format as defined by the integration framework. This does not necessarily imply that this format is specific to a particular programming language. However, it does imply that every component must adhere to the same format.

Another approach would lower the requirements further to the semantics of the information rather than the format. This could be accomplished by supporting multiple formats which could be converted between each other without losing any meaning. This implies that if type information is part of the semantics, a format such as YAML could not be converted to Stomp which expresses everything as strings. However, the other way around would be possible assuming that the YAML filter understood how to cast the information contained in the Stomp data.

The problem with supporting multiple formats is that this could result in an explosion of filters. To convert data between each format would require implementing a filter for each permutation, this is a many to many relationship. However, to convert from a single format to and from each other format, this becomes a one to many relationship.

Streaming Data

In the most general sense, streaming refers to making a sequence of data elements available over time. In this particular context, the data elements consist of test results and they should be made available in a pipeline on the command line.

  • Some data formats already support streaming messages on a pipeline. For example, the Stomp protocol was designed in the context of messaging systems where multiple messages are frequently exchanged between the sender and the receiver. The protocol clearly defines a separator to indicate the end of each message which can be demultiplexed by the receiver.
  • Other data formats were designed to serialize individual objects rather than multiple objects in a stream. For example, JSON expects to interpret a complete message all at once by eval’ing the content. So, there is essentially no concept of a separator to identify the beginning or end of individual messages within a stream.

The solution for these other data formats is to encapsulate individual messages into a container format which actually does support streaming. For example, a JSON message could be encapsulated into the body of a Stomp message which could then be streamed.

Summary

This post has only identified the single requirement that test results should be structured to be understood by other programs. The rest mostly consisted of considerations for which there are known solutions with varying impact. However, none of these considerations seem to be absolute blockers.

Test Result Codes

June 23, 2009

A test result consists of the output from executing a test method which makes a specific assertion. This post concentrates solely on qualitative results which can be identified with specific codes, such as pass or fail for example.

The scope of this post consists of exploring the various test result codes used by existing test environments. These consist not only of test frameworks but also protocols and standards.

Ultimately, the objective is to define integration mechanisms to limit test result codes to a minimal subset encompassing all test environments.

Overview

The following table provides an overview of the relation between test environments and their corresponding result codes:

  Checkbox DejaGnu Greg JUnit POSIX Subunit TAP TETware
ERROR       x   x    
FAIL x x x x x x x x
FIP               x
INCOMPLETE         x      
KFAIL   x            
KPASS   x            
NOTINUSE               x
PASS x x x x x x x x
SKIP x         x x  
TODO             x  
UNINITIATED               x
UNREPORTED               x
UNRESOLVED   x x   x     x
UNSUPPORTED   x x   x     x
UNTESTED   x x   x     x
UPASS     x          
WARNING               x
XFAIL   x x     x    
XPASS   x            

Test Environments

These are the test environments from the above table with some additional details:

  • Checkbox: Framework for integrating automated and manual test suites.
  • DejaGnu: Framework for testing other programs. Its purpose is to provide a single front end for all tests. Think of it as a custom library of Tcl procedures crafted to support writing a test harness. A test harness is the testing infrastructure that is created to support a specific program or tool. Each program can have multiple testsuites, all supported by a single test harness.
  • Greg: Framework for testing other programs and libraries. Its purpose is to provide a single front end for all tests and to be a small, simple framework for writing tests. Greg leverages off the Guile language to provide all the power (and more) of other test frameworks with greater simplicity and ease of use.
  • POSIX 1003.3: Information technology — Requirements and Guidelines for Test Methods Specifications and Test Method Implementations for Measuring Conformance to POSIX Standards.
  • JUnit: Simple framework for writing and running automated tests. As a political gesture, it celebrates programmers testing their own software.
  • Subunit: Streaming protocol for test results. The protocol is human readable and easily generated and parsed. By design all the components of the protocol conceptually fit into the xUnit TestCase->TestResult interaction.
  • TAP: The Test Anything Protocol (TAP) is a protocol to allow communication between unit tests and a test harness. It allows individual tests (TAP producers) to communicate test results to the testing harness in a language-agnostic way.
  • TETware: The TETware family of tools are Test Execution Management Systems that takes care of the administration, sequencing, reporting and portability of all of the tests that you develop. This happens to be the framework used by the Linux Standard Base test suite.

Result Codes

These are the test result codes supported by the above test environments. Some of the descriptions have been taken directly from their test environment whereas others have been simplified for the more general context of this post:

  • ERROR: A test executed in an unexpected fashion; this outcome requires that a human being go over results, to determine if the test should have passed or failed.
  • FAIL: A test has produced the bug it was intended to capture. That is, it has demonstrated that the assertion is false.
  • FIP: Further information must be provided manually; this occurs when the test is unable to determine whether the test should pass or fail.
  • INCOMPLETE: The test of the assertion was unable to prove PASS but encountered no FAILs.
  • KFAIL: A bug in the implementation under test is causing a known false assertion.
  • KPASS: A test was expected to fail with KFAIL but passed instead.
  • NOTINUSE: A test might not be required in certain modes or, when there are multiple versions of the test, only one can be used.
  • PASS: A test has succeeded. That is, it demonstrated that the assertion is true.
  • SKIP: There is a test for this assertion but it was purposefully skipped.
  • TODO: A test represents a feature to be implemented or a bug to be fixed. The assertion is expected to be false.
  • UNINITIATED: The particular test in question did not start to execute.
  • UNREPORTED: A major error occurred during the test execution.
  • UNRESOLVED: A test produced indeterminate results. This essentially means the same as an ERROR.
  • UNSUPPORTED: An optional feature is not available or not supported in the implementation under test.
  • UNTESTED: There is no test for this assertion. This is a placeholder, used when there is no real test case yet.
  • UPASS: A test was expected to fail but passed instead.
  • WARNING: A true assertion is currently expected, but later revisions may change the requirements in this area.
  • XFAIL: A bug in the environment running the test is causing a known false assertion. This is outside the control of the test.
  • XPASS: A test was expected to fail with XFAIL but passed instead. Whatever bug that used to exist in the environment was corrected.

Types of Codes

The POSIX 1003.3 standard distinguishes two types of test result codes for test method implementations:

  1. An intermediate test result code is one that requires further processing to determine the final result code. Test method implementations may use additional intermediate codes to provide the user with as much information as possible. The intermediate result codes, as interpreted from the above definitions, would consist of: UNREPORTED, UNRESOLVED, INCOMPLETE and UNINITIATED.
  2. The final test result codes require no further processing to determine the result of testing an assertion. As opposed to intermediate test result codes, test method implementations should not give any other meaning to the final test result codes. These would basically consist of all the other codes.

In other words, these types of test result codes can be expressed otherwise as incomplete and complete. In the context of integrating test environments, this provides a mechanism to coerce any number of codes into two types which serve a clear purpose.

Groups of Codes

The TETware framework distinguishes two groups of test result codes for reporting purposes. Only considering the codes from this test framework, the groups consist of:

  1. Passing codes: PASS, WARNING, FIP, UNSUPPORTED, NOTINUSE, UNTESTED
  2. Failing codes: FAIL, UNINITIATED, UNREPORTED, UNRESOLVED

These groups are particularly relevant in the context of the TETware framework in order to simplify the management of such a large set of test result codes. However, this can prove to be even more relevant in the context of integrating test environments which must innevitably support an even larger set of codes.

When considering this mechanism in combination with the types of codes mentionned above, it should be noted that the TETware framework assumes that incomplete codes are failures. An alternative approach could be to only group complete codes as either passing or failing.

Coercing Codes

The DejaGnu framework prides itself of being conforming to the POSIX standard explored in this post. However, it provides support for a few test result codes which are not mentionned in the standard. This is accomplished by coercing the following codes into the corresponding POSIX codes:

  • FAIL: KFAIL, XFAIL
  • PASS: KPASS, XPASS

This type of coercion compresses multiple test result codes into a single code. This should be considered a lossy compression because the outcome loses granularity and innevitably becomes more ambiguous.

Another type of coercion translates the test result code from one test environment to an equivalent code from another environment. However, as with compression, some granularity can potentially be lost in the process. For example, if the SKIP and UNTESTED codes were translated into the same code, it would no longer be possible to determine whether the underlying test method existed or not.

Summary

This post has identified three integration mechanisms which can be used to limit a set of test result codes. The common thread is that this limiting factor depends on the level of granularity required during integration. This is a sliding rule which essentially defines what can potentially constitute the minimal subset of codes.

Follow

Get every new post delivered to your Inbox.