I have 3 millions line of data each has 30 features – it is

Question

0

Asked: June 14, 20262026-06-14T06:39:59+00:00 2026-06-14T06:39:59+00:00

I have 3 millions line of data each has 30 features – it is

0

I have 3 millions line of data each has 30 features – it is hard to include all in memory for my computer and slow to process it with learning algorithm – . I want to write a little code that makes random sampling but in JAVA and with my PC configurations it does not work or takes so much times to execute. I know that writing in C or C++ gives better solution but I am also curious about the availability of python for such case. Is it reasonable to use Python in such a case that Java is not working efficiently because of slowness and memory restriction – please do not say to increase heap size or such-?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T06:40:01+00:00

If performance is critical, this is the sort of solution I use.

public class SimpleTable {
    private final List<RandomAccessFile> files = new ArrayList<RandomAccessFile>();
    private final List<FloatBuffer> buffers = new ArrayList<FloatBuffer>();
    private final File baseDir;
    private final int rows;

    private SimpleTable(File baseDir, int rows) {
        this.baseDir = baseDir;
        this.rows = rows;
    }

    public static SimpleTable create(String baseName, int rows) throws IOException {
        File baseDir = new File(baseName);
        if (!baseDir.mkdirs()) throw new IOException("Failed to create " + baseName);
        PrintWriter pw = new PrintWriter(baseName + "/rows");
        pw.println(rows);
        pw.close();
        return new SimpleTable(baseDir, rows);
    }

    public static SimpleTable load(String baseName) throws IOException {
        BufferedReader br = new BufferedReader(new FileReader(baseName + "/rows"));
        int rows = Integer.parseInt(br.readLine());
        br.close();
        File baseDir = new File(baseName);
        SimpleTable table = new SimpleTable(baseDir, rows);
        File[] files = baseDir.listFiles();
        Arrays.sort(files);
        for (File file : files) {
            if (!file.getName().endsWith(".float")) continue;
            table.addColumnForFile(file);
        }
        return table;
    }

    private FloatBuffer addColumnForFile(File file) throws IOException {
        RandomAccessFile rw = new RandomAccessFile(file, "rw");
        MappedByteBuffer mbb = rw.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, rows * 8);
        mbb.order(ByteOrder.nativeOrder());
        FloatBuffer db = mbb.asFloatBuffer();
        files.add(rw);
        buffers.add(db);
        return db;
    }

    public int rows() {
        return rows;
    }

    public int columns() {
        return buffers.size();
    }

    public FloatBuffer addColumn() throws IOException {
        return addColumnForFile(new File(baseDir, String.format("%04d.float", buffers.size())));
    }

    public FloatBuffer getColumn(int n) {
        return buffers.get(n);
    }

    public void close() throws IOException {
        for (RandomAccessFile file : files) {
            file.close();
        }
        files.clear();
        buffers.clear();
    }
}

public class SimpleTableTestMain {
    public static void main(String... args) throws IOException {
        long start = System.nanoTime();
        SimpleTable st = SimpleTable.create("test", 3 * 1000 * 1000);
        for (int i = 0; i < 50; i++) {
            FloatBuffer db = st.addColumn();
            for (int j = 0; j < db.capacity(); j++)
                db.put(j, i + j);
        }
        st.close();

        long mid = System.nanoTime();

        SimpleTable st2 = SimpleTable.load("test");
        for (int i = 0; i < 50; i++) {
            FloatBuffer db = st2.getColumn(i);
            double sum = 0;
            for (int j = 0; j < db.capacity(); j++)
                sum += db.get(j);
            assert sum > 0;
        }

        long end = System.nanoTime();
        System.out.printf("Took %.3f seconds to write and %.3f seconds to read %,d rows and %,d columns%n",
                (mid - start) / 1e9, (end - mid) / 1e9, st2.rows(), st2.columns());
        st2.close();
    }
}

prints

Took 2.070 seconds to write and 2.206 seconds to read 3,000,000 rows and 50 columns

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have 3 millions line of data each has 30 features – it is

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply