Interesting Bug
Friday 16 March 2012
I've spent the whole of today chasing down a bug. I've finally found what's causing it, yet currently have no idea how to solve it. I think it's a rather amazing bug which shows some very interesting behaviour.
Over the last couple of days, I've been building out a test suite so
that as I add additional features, I can have a degree of confidence
I've not obviously broken things. This morning I wrote a test which
deliberately has a large number of transactions that collide with each
other: indeed overall progress is slow because of the huge contention
created. The basic idea is we start with a global a
object, and
structure (this is all a bit simplified, but not by too much):
a = {0: {num: 1000},
1: {num: 1000},
2: {num: 1000}}
Then, every transaction decrements the num
field in every object it
finds within a
. Just to shake things up a bit more, randomly, each
transaction can replace one of the inner objects. The replacement will
also contain a num
field and the correct value. The test stops when
all the fields reach 0
.
I set up several clients connected to the same AtomizeJS server, and each client ran several transactions that looked like:
var fun;
fun = function (c) {
c.atomically(function () {
if (undefined === a) {
c.retry();
}
var keys = Object.keys(a),
x, field, n, obj;
for (x = 0; x < keys.length; x += 1) {
field = keys[x];
if (undefined === n) {
n = a[field].num;
if (0 === n) {
return n;
}
} else if (n !== a[field].num) {
throw ("All fields should have the same number: " +
n + " vs " + a[field].num);
}
if (0.5 < Math.random()) {
obj = c.lift({});
obj.num = n;
a[field] = obj;
}
a[field].num -= 1;
}
return n;
}, function (n) {
if (n > 0) {
fun(c); // recurse
} else {
// Test done!
}
});
}
And then, with various different AtomizeJS clients, invoke fun
with
the client, and set up a suitable a
object managed by AtomizeJS and
known to all the clients.
Turns out, we hit the exception within the transaction. Yup,
within the transaction, we can violate the isolation and atomic
properties. Even more interesting was the minimum requirements for
provoking the bug: you need two clients (i.e. two instances of
Atomize
) and three transactions in flight at the same time (i.e. one
client must run multiple copies of the transaction at the same time
- or at least as close as you can get in JavaScript: when one
transaction commits, and goes to the network to send the transaction
log to the server, whilst waiting for the response, it then goes and
starts the other transaction). Running all three transactions in the
same client can't provoke it, nor can running one transaction each in
three different clients. If you rewrite the test so that it does the
throw based on a test in the continuation (i.e. after each transaction
has committed) then it never goes wrong, which means that the
violation is eliminated when the transaction commits. But even so,
within a transaction, you should not be able to see the partial
effects of other transactions. The random changing of objects within
a
is crucial: if you don't replace the objects, the bug doesn't
appear.
So what on earth is going on?
Two clients: c1 and c2. Three transactions: t1, t2 and t3. For simplicity, I'm going to define these transactions precisely as:
t1 = function () {
a.0.num -= 1;
a.1 = atomize.lift({num: a.0.num});
a.2.num -= 1;
};
t2 = function () {
a.0.num -= 1;
a.1.num -= 1;
a.2.num -= 1;
};
t3 = t2;
Initially, both clients are aware of a
as shown at the top of this
post. The a
, and 0
, 1
and 2
objects are all at version 1 of
themselves, and this is known to both clients.
First, c1 runs t1 followed immediately by t2: i.e. whilst the transaction log of t1 is in flight to the server, c1 starts running t2. Thus both t1 and t2 get first run on the original objects, and so both will try to change the values of 1000 to 999.
The transaction t1 will have a read set of a
, 0
and 2
, and a
write set of a
, 0
and 2
. This transaction goes to the server,
commits successfully and comes back to c1 which updates its own
copies of the objects. The object at a.1
has changed: the new object
is at version 1 (along with the previous old object that used to be
reachable from a.1
), whilst all the other objects (a
, 0
and 2
)
are now at version 2. The num
fields have values of 999 now,
though the original object that was at a.1
has a num
value of 1000
still. Only the server and c1 know all this.
Whilst that was going on, c2 runs t3. This transaction is initially
run against the original objects (version 1 of everything, with
num
fields at 1000). The transaction log arrives at the server after
t1, and gets rejected because t1 committed successfully, and
changed the versions. The server sends back to c2 version 2 of
a
, 0
and 2
, along with version 1 of the new object reachable
from a.1
(but all with num
fields of 999). The client, c2 now
restarts t3. This time t3 has a transaction log with a read set of
a
, 0
, and 2
at version 2, plus the new 1
at version 1, and
a write set of 0
(version 2), 1
(version 1 - remember: the new
replacement object), and 2
(version 2). This goes to the server
and commits correctly. The version numbers are bumped accordingly:
both c2 and the server agree that a
, 0
and 2
are now at
version 3, the old original 1
(which is no longer reachable) is at
version 1, and the new 1
is at version 2. The num
fields are
all now at 998, except for the old a.1
object that's still at 1000,
but it's unreachable, so it doesn't matter.
Now, the transaction log from t2 arrives at the server. It was run
by c1 a while ago, indeed against the original objects (version 1
of everything - num
fields at 1000), but the server's been kept busy
and is only now getting around to dealing with it. Blame the
network. The transaction log of t2 contains reads of version 1 of
a
, 0
, 1
(the original 1
) and 2
. The server notices that
these are old versions and rejects the transaction. But here comes the
problem: the server sends down the current versions of a
, 0
and
2
(all version 3 - num
fields at 998). It does not send down the
current version of a.1
because the transaction log from t2 had
nothing to do with the current object at a.1
: it only mentions the
old original object that was at a.1
, and no one's modified that
object: it still has a num
field at 1000.
So now the client c1 applies those updates, and restarts t2. Now
t2 sees that a.0
and a.1
have num
fields at 998, but a.1.num
is actually at 999, because t2 (and indeed c1 as a whole) has not
seen the effect of t3 (run by c2) on a.1
: it's only seen the
effect of t1. Thus isolation is broken: t2 is seeing parts of the
world from before t3 committing and other parts from
afterwards.
When t2 now commits again, the server will again reject it because
this time, t2's transaction log will contain a read and a write of
the new object at a.1
, but at version 1, not version 2. So the
server will reject it, send the update to the new a.1
which the
client c1 will apply to get its a.1
up to version 2, and finally
t2 will be restarted and this time will commit successfully.
Race conditions like these are always fun to unravel. Even more fun is that I currently have no idea how to solve this: it's almost like we need some sort of dependency chain to say that "if the server is going to send down version X of object J, then it must also send down version Y of object K". In many ways, this seems to be rather like a cache invalidation problem. I wonder how other STM systems solve this, whether they don't, or whether the problem really only appears due to the distributed nature of AtomizeJS. I think it might be the latter.