I do have a table with list of files. There is id_folder, id_parrent_folder, size

Question

0

Asked: June 14, 20262026-06-14T03:21:34+00:00 2026-06-14T03:21:34+00:00

I do have a table with list of files. There is id_folder, id_parrent_folder, size

0

I do have a table with list of files. There is id_folder, id_parrent_folder, size (file size):

create table sample_data (
    id_folder bigint ,
    id_parrent_folder bigint,
    size bigint
);

I would like to know, how many files are in every subfolder (including current folder) for each folder (starting wigh given folder). Given the samle data posted below I expect the following output:

id_folder     files
100623           35
100624           14

Sample data:

insert into sample_data values (100623,58091,60928);
insert into sample_data values (100623,58091,59904);
insert into sample_data values (100623,58091,54784);
insert into sample_data values (100623,58091,65024);
insert into sample_data values (100623,58091,25600);
insert into sample_data values (100623,58091,31744);
insert into sample_data values (100623,58091,27648);
insert into sample_data values (100623,58091,39424);
insert into sample_data values (100623,58091,30720);
insert into sample_data values (100623,58091,71168);
insert into sample_data values (100623,58091,68608);
insert into sample_data values (100623,58091,34304);
insert into sample_data values (100623,58091,46592);
insert into sample_data values (100623,58091,35328);
insert into sample_data values (100623,58091,29184);
insert into sample_data values (100623,58091,38912);
insert into sample_data values (100623,58091,38400);
insert into sample_data values (100623,58091,49152);
insert into sample_data values (100623,58091,14444);
insert into sample_data values (100623,58091,33792);
insert into sample_data values (100623,58091,14789);
insert into sample_data values (100624,100623,16873);
insert into sample_data values (100624,100623,32768);
insert into sample_data values (100624,100623,104920);
insert into sample_data values (100624,100623,105648);
insert into sample_data values (100624,100623,31744);
insert into sample_data values (100624,100623,16431);
insert into sample_data values (100624,100623,46592);
insert into sample_data values (100624,100623,28160);
insert into sample_data values (100624,100623,58650);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);

I’ve tried to use example from postgresql (postgresql docs), but it (obviously) can’t work this way. Any help appreciated.

— Edit

I’ve tried the following query:

WITH RECURSIVE included_files(id_folder, parrent_folder, dist_last_change) AS (
SELECT 
    id_folder, 
    id_parrent_folder, 
    size
FROM 
    sample_data p 
WHERE 
    id_folder = 100623
UNION ALL
SELECT 
    p.id_folder, 
    p.id_parrent_folder, 
    p.size
FROM 
    included_files if, 
    sample_data p
WHERE 
    p.id_parrent_folder = if.id_folder
)
select * from included_files

This won’t work, because for every child there is a lot of parents and as a result rows in child folders are multiplied.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T03:21:35+00:00

Very nice problem to think about, I upvoted!

As I see it, 2 cases to think about:

multi-level paths and
multi-child nodes.

So far I’ve came up with the following query:

WITH RECURSIVE tree AS (
    SELECT id_folder id, array[id_folder] arr
      FROM sample_data sd
     WHERE NOT EXISTS (SELECT 1 FROM sample_data s
                        WHERE s.id_parrent_folder=sd.id_folder)
    UNION ALL
    SELECT sd.id_folder,t.arr||sd.id_folder
      FROM tree t
      JOIN sample_data sd ON sd.id_folder IN (
        SELECT id_parrent_folder FROM sample_data WHERE id_folder=t.id))
,ids AS (SELECT DISTINCT id, unnest(arr) ua FROM tree)
,agg AS (SELECT id_folder id,count(*) cnt FROM sample_data GROUP BY 1)
SELECT ids.id, sum(agg.cnt)
  FROM ids JOIN agg ON ids.ua=agg.id
 GROUP BY 1
 ORDER BY 1;

I’ve added the following rows to the sample_data:

INSERT INTO sample_data VALUES (100625,100623,123);
INSERT INTO sample_data VALUES (100625,100623,456);
INSERT INTO sample_data VALUES (100625,100623,789);
INSERT INTO sample_data VALUES (100626,100625,1);

This query is not optimal though and will be slowing down as number of rows grows.

Full-scale tests

In order to simulate original situation, I’ve done a small python script that scans filesystem and stores it into the database (thus the delay, I’m not yet good at python scripting).

The following tables had been created:

CREATE TABLE fs_file(file_id bigserial, name text, type char(1), level int4);
CREATE TABLE fs_tree(file_id int8, parent_id int8, size int8);

Scanning whole filesystem of my MBP took 7.5 minutes and I have 870k entries in the fs_tree table, which is quite similar to the original task. After upload, the following was run:

CREATE INDEX i_fs_tree_1 ON fs_tree(file_id);
CREATE INDEX i_fs_tree_2 ON fs_tree(parent_id);
VACUUM ANALYZE fs_file;
VACUUM ANALYZE fs_tree;

I’ve tried running my first query on this data and had to kill it after aprx 1 hour. The improved one takes round 2 minutes (on my MBP) to do the job on the whole filesystem. Here it comes:

WITH RECURSIVE descent AS (
    SELECT fs.file_id grp, fs.file_id, fs.size, 1 k, 0 AS lvl
      FROM fs_tree fs
     WHERE fs.parent_id = (SELECT file_id FROM fs_file WHERE name = '/')
    UNION ALL
    SELECT DISTINCT CASE WHEN k.k=0 THEN d.grp ELSE fs.file_id END AS grp,
           fs.file_id, fs.size, k.k, d.lvl+1
      FROM descent d
      JOIN fs_tree fs ON d.file_id=fs.parent_id
      CROSS JOIN generate_series(0,1) k(k))
/* the query */
SELECT grp, file_id, size, k, lvl
  FROM descent
 ORDER BY 1,2,3;

Query uses my table names, but it shouldn’t be difficult to change it. It will build a set of groups for each file_id found in the fs_tree. To get the desired output, you can do something like:

SELECT grp AS file_id, count(*), sum(size)
  FROM descent GROUP BY 1;

Some notes:

query will work only if there’re no duplicates. I think it is a right way to go, ‘cos it is impossible to have 2 equally named entries in a single directory;
query doesn’t care bout the depth or sibling count of the tree, though this does have impact on the performance;
for me it was good experience, as similar functionality is needed also for task planning systems (I’m working with one at the moment);
as tasks are considered, single entry can have multiple parents (but not otherwise) and query will still work;
this problem can be solved in other ways too, like traversing the tree in ascending order, or using pre-calculated values to avoid the final grouping step, but this is getting a bit bigger then a simple question, so I live it as an exercise for you.

Recommendations

To get this query work, you should prepare your data by aggregating it:

WITH RECURSIVE
fs_tree AS (
    SELECT id_folder file_id, id_parrent_folder parent_id,
           sum(size) AS size, count(*) AS cnt
      FROM sample_data GROUP BY 1,2)
,descent AS (
    SELECT fs.file_id grp, fs.file_id, fs.size, fs.cnt, 1 k, 0 AS lvl
      FROM fs_tree fs
     WHERE fs.parent_id = 58091
    UNION ALL
    SELECT DISTINCT CASE WHEN k.k=0 THEN d.grp ELSE fs.file_id END AS grp,
           fs.file_id, fs.size, fs.cnt, k.k, d.lvl+1
      FROM descent d
      JOIN fs_tree fs ON d.file_id=fs.parent_id
      CROSS JOIN generate_series(0,1) k(k))
/* the query */
SELECT grp file_id, sum(size) size, sum(cnt) cnt
  FROM descent
 GROUP BY 1
 ORDER BY 1,2,3;

In order to speed things up, you can implement Materialized Views and pre-calculate some metrics.

Sample data

Here’s a small dump that will show the data inside the tables:

INSERT INTO fs_file VALUES (1, '/Users/viy/prj/logs', 'D', 0),
    (2, 'jobs', 'D', 1),
    (3, 'pg_csv_load', 'F', 2),
    (4, 'pg_logs', 'F', 2),
    (5, 'logs.sql', 'F', 1),
    (6, 'logs.sql~', 'F', 1),
    (7, 'pgfouine-1.2.tar.gz', 'F', 1),
    (8, 'u.sql', 'F', 1),
    (9, 'u.sql~', 'F', 1);

INSERT INTO fs_tree VALUES (1, NULL, 0),
    (2, 1, 0),
    (3, 2, 936),
    (4, 2, 706),
    (5, 1, 4261),
    (6, 1, 4261),
    (7, 1, 793004),
    (8, 1, 491),
    (9, 1, 491);

Note, that I’ve slightly updated create statements.

And this is the script I’ve used to scan the filesystem:

#!/usr/bin/python

import os
import psycopg2
import sys
from stat import *

def walk_tree(full, parent, level, call_back):
    '''recursively descend the directory tree rooted at top,
       calling the callback function for each regular file'''

    if not os.access(full, os.R_OK):
        return

    for f in os.listdir(full):
        path = os.path.join(full, f)
        if os.path.islink(path):
            # It's a link, register and continue
            e = entry(f, "L", level)
            call_back(parent, e, 0)
            continue

        mode = os.stat(path).st_mode
        if S_ISDIR(mode):
            e = entry(f, "D", level)
            call_back(parent, e, 0)
            # It's a directory, recurse into it
            try:
                walk_tree(path, e, level+1, call_back)
            except OSError:
                pass

        elif S_ISREG(mode):
            # It's a file, call the callback function
            call_back(parent, entry(f, "F", level), os.stat(path).st_size)
        else:
            # It's unknown, just register
            e = entry(f, "U", level)
            call_back(parent, e, 0)

def register(parent, entry, size):
    db_cur.execute("INSERT INTO fs_tree VALUES (%s,%s,%s)",
                   (entry, parent, size))

def entry(name, type, level):
    db_cur.execute("""INSERT INTO fs_file(name,type, level)
                   VALUES (%s, %s, %s) RETURNING file_id""",
                   (name, type, level))
    return db_cur.fetchone()[0]

db_con=psycopg2.connect("dbname=postgres")
db_cur=db_con.cursor()

if len(sys.argv) != 2:
    raise SyntaxError("Root directory expected!")

if not S_ISDIR(os.stat(sys.argv[1]).st_mode):
    raise SyntaxError("A directory is wanted!")

e=entry(sys.argv[1], "D", 0)
register(None, e, 0)
walk_tree(sys.argv[1], e, 1, register)

db_con.commit()

db_cur.close()
db_con.close()

This script is for Python 3.2 and is based on the example from official python documentation.

Hope this clarifies things for you.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I do have a table with list of files. There is id_folder, id_parrent_folder, size

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply