We currently have some data on an HDFS cluster on which we generate reports

Question

0

Asked: May 24, 20262026-05-24T18:35:03+00:00 2026-05-24T18:35:03+00:00

We currently have some data on an HDFS cluster on which we generate reports

0

We currently have some data on an HDFS cluster on which we generate reports using Hive. The infrastructure is in the process of being decommissioned and we are left with the task of coming up with an alternative of generating the report on the data (which we imported as tab separated files into our new environment)

Assuming we have a table with the following fields.

Query
IPAddress
LocationCode

Our original SQL query we used to run on Hive was (well not exactly.. but something similar)

select 
COUNT(DISTINCT Query, IPAddress) as c1,
LocationCode as c2, 
Query as c3
from table
group by Query, LocationCode

I was wondering if someone could provide me with an the most efficient script using standard unix/linux tools such as sort, uniq and awk which can act as a replacement for the above query.

Assume the input to the script would be a directory of text files. the dir would contain about 2000 files. Each file would contain arbitrary number of tab separated records of the form :

Query <TAB> LocationCode <TAB> IPAddress <NEWLINE>

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T18:35:04+00:00

Once you have a sorted file containing all the unique

Query <TAB> LocationCode <TAB> IPAddress <NEWLINE>

you could:

awk -F '\t' 'NR == 1 {q=$1; l=$2; count=0}
q == $1 && l == $2{count++}
q != $1 || l != $2{printf "%s\t%s\t%d\n", q, l, count; q=$1; l=$2; count=1}
END{printf "%s\t%s\t%d\n", q, l, count}' sorted_uniq_file

To get this sorted_uniq_file the naive way can be:

sort -u dir/* > sorted_uniq_file

But this can be very long and memory consuming.

A faster option (and less memory consuming) could be to eliminate duplicate as soon as possible, sorting first and merging later. This needs a temporary space for the sorted file, let use a directory named sorted:

mkdir sorted;
for f in dir/*; do
   sort -u $f > sorted/$f
done
sort -mu sorted/* > sorted_uniq_file
rm -rf sorted

If the solution above hit some shell or sort limit (expansion of dir/*, or of sorted/*, or number of parameters of sort):

mkdir sorted;
ls dir | while read f; do
   sort -u dir/$f > sorted/$f
done
while [ `ls sorted | wc -l` -gt 1 ]; do
  mkdir sorted_tmp
  ls sorted | while read f1; do
    if read f2; then
      sort -mu sorted/$f1 sorted/$f2 > sorted_tmp/$f1
    else
      mv sorted/$f1 sorted_tmp
    fi
  done
  rm -rf sorted
  mv sorted_tmp sorted
done
mv sorted/* sorted_uniq_file
rm -rf sorted

The solution above can be optimized to merge more that 2 files at the same time.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

We currently have some data on an HDFS cluster on which we generate reports

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply