I do have a table with list of files. There is id_folder, id_parrent_folder, size (file size):
create table sample_data (
id_folder bigint ,
id_parrent_folder bigint,
size bigint
);
I would like to know, how many files are in every subfolder (including current folder) for each folder (starting wigh given folder). Given the samle data posted below I expect the following output:
id_folder files
100623 35
100624 14
Sample data:
insert into sample_data values (100623,58091,60928);
insert into sample_data values (100623,58091,59904);
insert into sample_data values (100623,58091,54784);
insert into sample_data values (100623,58091,65024);
insert into sample_data values (100623,58091,25600);
insert into sample_data values (100623,58091,31744);
insert into sample_data values (100623,58091,27648);
insert into sample_data values (100623,58091,39424);
insert into sample_data values (100623,58091,30720);
insert into sample_data values (100623,58091,71168);
insert into sample_data values (100623,58091,68608);
insert into sample_data values (100623,58091,34304);
insert into sample_data values (100623,58091,46592);
insert into sample_data values (100623,58091,35328);
insert into sample_data values (100623,58091,29184);
insert into sample_data values (100623,58091,38912);
insert into sample_data values (100623,58091,38400);
insert into sample_data values (100623,58091,49152);
insert into sample_data values (100623,58091,14444);
insert into sample_data values (100623,58091,33792);
insert into sample_data values (100623,58091,14789);
insert into sample_data values (100624,100623,16873);
insert into sample_data values (100624,100623,32768);
insert into sample_data values (100624,100623,104920);
insert into sample_data values (100624,100623,105648);
insert into sample_data values (100624,100623,31744);
insert into sample_data values (100624,100623,16431);
insert into sample_data values (100624,100623,46592);
insert into sample_data values (100624,100623,28160);
insert into sample_data values (100624,100623,58650);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
insert into sample_data values (100624,100623,162);
I’ve tried to use example from postgresql (postgresql docs), but it (obviously) can’t work this way. Any help appreciated.
— Edit
I’ve tried the following query:
WITH RECURSIVE included_files(id_folder, parrent_folder, dist_last_change) AS (
SELECT
id_folder,
id_parrent_folder,
size
FROM
sample_data p
WHERE
id_folder = 100623
UNION ALL
SELECT
p.id_folder,
p.id_parrent_folder,
p.size
FROM
included_files if,
sample_data p
WHERE
p.id_parrent_folder = if.id_folder
)
select * from included_files
This won’t work, because for every child there is a lot of parents and as a result rows in child folders are multiplied.
Very nice problem to think about, I upvoted!
As I see it, 2 cases to think about:
So far I’ve came up with the following query:
I’ve added the following rows to the
sample_data:This query is not optimal though and will be slowing down as number of rows grows.
Full-scale tests
In order to simulate original situation, I’ve done a small python script that scans filesystem and stores it into the database (thus the delay, I’m not yet good at python scripting).
The following tables had been created:
Scanning whole filesystem of my MBP took 7.5 minutes and I have 870k entries in the
fs_treetable, which is quite similar to the original task. After upload, the following was run:I’ve tried running my first query on this data and had to kill it after aprx 1 hour. The improved one takes round 2 minutes (on my MBP) to do the job on the whole filesystem. Here it comes:
Query uses my table names, but it shouldn’t be difficult to change it. It will build a set of groups for each
file_idfound in thefs_tree. To get the desired output, you can do something like:Some notes:
Recommendations
To get this query work, you should prepare your data by aggregating it:
In order to speed things up, you can implement Materialized Views and pre-calculate some metrics.
Sample data
Here’s a small dump that will show the data inside the tables:
Note, that I’ve slightly updated create statements.
And this is the script I’ve used to scan the filesystem:
This script is for Python 3.2 and is based on the example from official python documentation.
Hope this clarifies things for you.