OK, let me see if I can explain.
I have some code that wraps a Java iterator (from Hadoop, as it happens) in a Scala Stream, so that it potentially can be read more than once, by client code that I have no direct control over. The last thing that gets done with this Stream is a reduce() operation. Stream remembers all the items that it’s already seen. Unfortunately, in some circumstances the iterator will be extremely large, so that storing all the items in it will lead to out-of-memory errors. However, in general, the situations where the client code needs the multiple-iteration facility are not the same ones with the memory-busting Iterators, and if such cases do exist, that’s not my problem.
What I want to ensure is that I can provide the memoizing capability for code that needs it, but not for code that doesn’t need it (in particular, for code that never looks at the Stream at all).
The code for reduce() in Stream says that it’s written in a way to allow for GC of the already-visited parts of the Stream to happen while reducing. So if I can make sure this actually happens, I’ll be fine. But in practice how can I make sure that this happens? In particular, if function A creates and passes the stream to function B, and function B passes the stream to function C, and function C then calls reduce(), then what about the references to the stream still in functions A, B and C? In all these cases, there will be no further use of the stream in any of the three functions, although the calls aren’t necessarily tail-recursive. Is the JVM smart enough to ensure that its reference count is 0 from functions A, B and C at the time that reduce() is called, so that the GC can happen? Essentially this means that the JVM notices in function A that the last thing it does with the item is call function B, so it eliminates its own handle at the same time it calls B, and likewise for B to C, and C to reduce().
If this works properly, does it also work if A, B or C has a local variable holding onto the item? (Which, again, won’t be used, afterwards.) That’s because it’s rather more tricky to code this properly without using local vars.
A variable which is in scope but which will never be read from is dead. A JVM is free to ignore dead variables for the purposes of garbage collection; an object which is only pointed to by dead variables is unreachable, and may be collected. The relevant bit of the JLS is, obscurely enough, §12.6.1 Implementing Finalization, which says:
And explains that:
If your method A has only dead variables referring to the stream, then it won’t obstruct its collection.
Note, however, that that means local variables: if you have fields which refer to the stream (including closed-over local variables from a method enclosing a nested class), then this doesn’t apply; i don’t think the JVM is allowed to treat these as dead. In other words, here:
The object
ocannot be collected until the anonymousCallableis collected, even though it is never used after thetoStringcall, because there is a synthetic field referring to it in theCallable.