In short: in Clojure, is there a way to redefine a function from the

Question

0

Asked: June 12, 20262026-06-12T23:23:16+00:00 2026-06-12T23:23:16+00:00

In short: in Clojure, is there a way to redefine a function from the

0

In short: in Clojure, is there a way to redefine a function from the standard sequence API (which is not defined on any interface like ISeq, IndexedSeq, etc) on a custom sequence type I wrote?

1. Huge data files

I have big files in the following format:

A long (8 bytes) containing the number n of entries
n entries, each one being composed of 3 longs (ie, 24 bytes)

2. Custom sequence

I want to have a sequence on these entries. Since I cannot usually hold all the data in memory at once, and I want fast sequential access on it, I wrote a class similar to the following:

(deftype DataSeq [id
                  ^long cnt
                  ^long i
                  cached-seq]
  clojure.lang.IndexedSeq

  (index [_]     i)
  (count [_]     (- cnt i))
  (seq   [this]  this)
  (first [_]     (first cached-seq))
  (more  [this]  (if-let [s (next this)] s '()))

  (next [_] (if (not= (inc i) cnt)
              (if (next cached-seq)
                (DataSeq. id cnt (inc i) (next cached-seq))
                (DataSeq. id cnt (inc i)
                          (with-open [f (open-data-file id)]
                             ; open a memory mapped byte array on the file
                             ; seek to the exact position to begin reading
                             ; decide on an optimal amount of data to read
                             ; eagerly read and return that amount of data
                          ))))))

The main idea is to read ahead a bunch of entries in a list and then consume from that list. Whenever the cache is completely consumed, if there are remaining entries, they are read from the file in a new cache list. Simple as that.

To create an instance of such a sequence, I use a very simple function like:

(defn ^DataSeq load-data [id]
  (next (DataSeq. id (count-entries id) -1 [])))
; count-entries is a trivial "open file and read a long" memoized

As you can see, the format of the data allowed me to implement count in very simply and efficiently.

3. `drop` could be O(1)

In the same spirit, I’d like to reimplement drop. The format of these data files allows me to reimplement drop in O(1) (instead of the standard O(n)), as follows:

if dropping less then the remaining cached items, just drop the same amount from the cache and done;
if dropping more than cnt, then just return the empty list.
otherwise, just figure out the position in the data file, jump right into that position, and read data from there.

My difficulty is that drop is not implemented in the same way as count, first, seq, etc. The latter functions call a similarly named static method in RT which, in turn, calls my implementation above, while the former, drop, does not check if the instance of the sequence it is being called on provides a custom implementation.

Obviously, I could provide a function named anything but drop that does exactly what I want, but that would force other people (including my future self) to remember to use it instead of drop every single time, which sucks.

So, the question is: is it possible to override the default behaviour of `drop`?

4. A workaround (I dislike)

While writing this question, I’ve just figured out a possible workaround: make the reading even lazier. The custom sequence would just keep an index and postpone the reading operation, that would happen only when first was called. The problem is that I’d need some mutable state: the first call to first would cause some data to be read into a cache, all the subsequent calls would return data from this cache. There would be a similar logic on next: if there’s a cache, just next it; otherwise, don’t bother populating it — it will be done when first is called again.

This would avoid unnecessary disk reads. However, this is still less than optimal — it is still O(n), and it could easily be O(1).

Anyways, I don’t like this workaround, and my question is still open. Any thoughts?

Thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T23:23:17+00:00

For the time being, I implemented the workaround I described above. It works by deferring the reading to the first call to (first), which will store the data on a local, mutable cache.

Note that this version uses unsynchronized-mutable (to avoid volatile-reads on every call to first, next and more and a volatile-write on the first call to first). In other words: DON’T SHARE AMONG THREADS. To make it thread-safe, use volatile-mutable instead (which causes a small performance penalty). It could still cause multiple reads of the same data by different threads. To avoid that, change back to unsynchronized-mutable and be sure to use (locking this ...) when reading from or writing to the field cache.

EDIT: after some (non rigorous) tests, it seems that the overhead introduced by (locking this ...) is similar to the one introduced by unnecessary reads from disk (note that I’m reading from a fast SSD, that might have already cached part of the data). Therefore, the best thread-safe solution for now (and for my specific hardware) would be to use a volatile cache.

(deftype DataSeq [id
                  ^long cnt
                  ^long i
                  ^{:unsynchronized-mutable true} cache]
  clojure.lang.IndexedSeq

  (index [_]    i)
  (count [_]    (- cnt i))
  (seq   [this] this)
  (more  [this] (if-let [s (.next this)] s '()))
  (next  [_]    (if (not= (inc i) cnt)
                  (DataSeq. id cnt (inc i) (next cache))))
  (first [_]
    (when-not (seq cache)
      (set! cache
            (with-open [f (open-data-file id)]
              ; open a memory mapped byte array on the file
              ; seek to the exact position to begin reading
              ; decide on an optimal amount of data to read
              ; eagerly read and return that amount of data
            )))
    (first cache)))

What still bothers me is that I must use mutable state just to stop drop (ie, “get out, you useless piece of data”) from reading from the disk…

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In short: in Clojure, is there a way to redefine a function from the

1. Huge data files

2. Custom sequence

3. drop could be O(1)

So, the question is: is it possible to override the default behaviour of drop?

4. A workaround (I dislike)

Leave an answerCancel reply

1 Answer

3. `drop` could be O(1)

So, the question is: is it possible to override the default behaviour of `drop`?

Leave an answer
Cancel reply