The following very simple query from the XQTS 3.1 will result in a java.lang.OutOfMemoryError.
count(subsequence(1 to 3000000000, -2147483649))
It should also be considered as a security issue. This simple query could be sent to the REST end-point of any eXist-db server (which exposes the service), and it will cause the JVM to shutdown.
The XQTS states that we can either return the value 3000000000, or any error code. So we can decide what we want to do here.
let's go with 3000000000
@duncdrum that would be the ideal solution, however that value has to be computed, and doing that efficiently (i.e. not running our of memory) would need quite some changes to eXist-db as far as I can see.
@adamretter major work for a major release? I checked saxon and it produces and error, about the maximum sequence length no exceeding 2147483647. I vaguely recall new BigInteger methods coming to java 8 could we maybe leverage that for throwing ArithmenticException?
@duncdrum so the problem is as always, who will do the Major work?
I don't think BigInteger can help us here. We could easily go the Saxon route and return an error. If we don't want to do that, then the OutOfMemoryError problem at the moment is caused by the interactions between RangeSequence and FunSubSequence.
(I am CC'ing @wolfgangmm @dizzzz @joewiz @ljo as I think the explanation below will be of interest to them).
From RangeSequence line ~63:
private static class RangeSequenceIterator implements SequenceIterator {
...
public Item nextItem() {
if (current <= end) {
return new IntegerValue(current++);
} else {
return null;
}
}
...
}
Basically RangeSequence#RangeSequenceIterator backs the XQuery range expression, e.g. 1 to 3000000000. For each call of RangeSequenceIterator#nextItem(), it will create a new IntegerValue Java object in memory that holds a single value of the range.
From FunSubSequence.java line ~149:
public class FunSubSequence extends Function {
...
public Sequence eval(Sequence contextSequence, Item contextItem) throws XPathException {
...
Item item;
final SequenceIterator iterator = seq.iterate();
for(int i = 0; i < start; i++) {
item = iterator.nextItem();
}
int i=0;
while (iterator.hasNext() && i < length) {
item = iterator.nextItem();
tmp.add(item);
i++;
}
...
}
}
Above, when the fn:subsequence is eval(uated), on the seq (which is a RangeSequence in this instance):
the first loop (the for loop), moves to the $startingLoc position (which is the 1st argument to fn:subsequence) in the underlying range sequence. Unfortunately it does this by calling RangeSequenceIterator#nextItem(), which creates a new IntegerValue object for each call. This is very inefficient as we don't need those IntegerValue objects, because we are just re-positioning to the location we need to start from. We are wasting memory and CPU here unnecessarily.
the second loop (the while loop), iterates from $startingLoc to the position indicated by $length (which is the 2nd argument to fn:subsequence). If $length is not provided, we just iterate until the end of the RangeSequence. Again each iteration causes RangeSequenceIterator#nextItem() to be called, which creates a new IntegerValue.
So the query fn:subsequence(1 to 3000000000, 1) will cause 3000000000 IntegerValue objects to be allocated.
Unfortunately because of the design above, the similar query: fn:subsequence(1 to 3000000000, 2999999999) will also cause 3000000000 IntegerValue objects to be allocated, even though we ignore the first 2999999999 items :-(
An IntegerValue in eXist-db contains both a int (which indicates the XDM type) and a BigInteger (the actual value), so its memory use is the requirements of those two plus the object framing overhead. We can calculate the rough size of an IntegerValue as follows:
int itself is 4 bytes.BigInteger is 8 bytes (on a 64 bit machine).BigInteger holding a 32 bit number will consume approx 68 bytes. A BigInteger holding a 64 bit number will consume approx 72 bytes.So the size of an IntegerValue (for this experiment) is approximately between 80 bytes and 84 bytes.
So the memory use for the range of 1 to 3000000000, will be between 223.52 GB and 234.69 GB... Hence the OutOfMemoryError!
We have a few options for fixing this:
fn:count - fn:count should not need to actually retrieve items from the sequence to get the count of them, unless the number of items in the sequence is not stable/deterministic.
Most helpful comment
@duncdrum so the problem is as always, who will do the Major work?
I don't think
BigIntegercan help us here. We could easily go the Saxon route and return an error. If we don't want to do that, then theOutOfMemoryErrorproblem at the moment is caused by the interactions betweenRangeSequenceandFunSubSequence.(I am CC'ing @wolfgangmm @dizzzz @joewiz @ljo as I think the explanation below will be of interest to them).
From
RangeSequenceline ~63:Basically
RangeSequence#RangeSequenceIteratorbacks the XQuery range expression, e.g.1 to 3000000000. For each call ofRangeSequenceIterator#nextItem(), it will create a newIntegerValueJava object in memory that holds a single value of the range.From
FunSubSequence.javaline ~149:Above, when the
fn:subsequenceiseval(uated), on theseq(which is aRangeSequencein this instance):the first loop (the
forloop), moves to the$startingLocposition (which is the 1st argument tofn:subsequence) in the underlying range sequence. Unfortunately it does this by callingRangeSequenceIterator#nextItem(), which creates a newIntegerValueobject for each call. This is very inefficient as we don't need thoseIntegerValueobjects, because we are just re-positioning to the location we need to start from. We are wasting memory and CPU here unnecessarily.the second loop (the
whileloop), iterates from$startingLocto the position indicated by$length(which is the 2nd argument tofn:subsequence). If$lengthis not provided, we just iterate until the end of theRangeSequence. Again each iteration causesRangeSequenceIterator#nextItem()to be called, which creates a newIntegerValue.So the query
fn:subsequence(1 to 3000000000, 1)will cause 3000000000IntegerValueobjects to be allocated.Unfortunately because of the design above, the similar query:
fn:subsequence(1 to 3000000000, 2999999999)will also cause 3000000000IntegerValueobjects to be allocated, even though we ignore the first2999999999items :-(An
IntegerValuein eXist-db contains both aint(which indicates the XDM type) and aBigInteger(the actual value), so its memory use is the requirements of those two plus the object framing overhead. We can calculate the rough size of anIntegerValueas follows:intitself is 4 bytes.BigIntegeris 8 bytes (on a 64 bit machine).BigIntegerholding a 32 bit number will consume approx 68 bytes. ABigIntegerholding a 64 bit number will consume approx 72 bytes.So the size of an
IntegerValue(for this experiment) is approximately between 80 bytes and 84 bytes.So the memory use for the range of
1 to 3000000000, will be between 223.52 GB and 234.69 GB... Hence theOutOfMemoryError!We have a few options for fixing this:
fn:count- fn:count should not need to actually retrieve items from the sequence to get the count of them, unless the number of items in the sequence is not stable/deterministic.