Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6334047
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T18:35:03+00:00 2026-05-24T18:35:03+00:00

We currently have some data on an HDFS cluster on which we generate reports

  • 0

We currently have some data on an HDFS cluster on which we generate reports using Hive. The infrastructure is in the process of being decommissioned and we are left with the task of coming up with an alternative of generating the report on the data (which we imported as tab separated files into our new environment)

Assuming we have a table with the following fields.

  • Query
  • IPAddress
  • LocationCode

Our original SQL query we used to run on Hive was (well not exactly.. but something similar)

select 
COUNT(DISTINCT Query, IPAddress) as c1,
LocationCode as c2, 
Query as c3
from table
group by Query, LocationCode

I was wondering if someone could provide me with an the most efficient script using standard unix/linux tools such as sort, uniq and awk which can act as a replacement for the above query.

Assume the input to the script would be a directory of text files. the dir would contain about 2000 files. Each file would contain arbitrary number of tab separated records of the form :

Query <TAB> LocationCode <TAB> IPAddress <NEWLINE>
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T18:35:04+00:00Added an answer on May 24, 2026 at 6:35 pm

    Once you have a sorted file containing all the unique

    Query <TAB> LocationCode <TAB> IPAddress <NEWLINE>
    

    you could:

    awk -F '\t' 'NR == 1 {q=$1; l=$2; count=0}
    q == $1 && l == $2{count++}
    q != $1 || l != $2{printf "%s\t%s\t%d\n", q, l, count; q=$1; l=$2; count=1}
    END{printf "%s\t%s\t%d\n", q, l, count}' sorted_uniq_file
    

    To get this sorted_uniq_file the naive way can be:

    sort -u dir/* > sorted_uniq_file
    

    But this can be very long and memory consuming.

    A faster option (and less memory consuming) could be to eliminate duplicate as soon as possible, sorting first and merging later. This needs a temporary space for the sorted file, let use a directory named sorted:

    mkdir sorted;
    for f in dir/*; do
       sort -u $f > sorted/$f
    done
    sort -mu sorted/* > sorted_uniq_file
    rm -rf sorted
    

    If the solution above hit some shell or sort limit (expansion of dir/*, or of sorted/*, or number of parameters of sort):

    mkdir sorted;
    ls dir | while read f; do
       sort -u dir/$f > sorted/$f
    done
    while [ `ls sorted | wc -l` -gt 1 ]; do
      mkdir sorted_tmp
      ls sorted | while read f1; do
        if read f2; then
          sort -mu sorted/$f1 sorted/$f2 > sorted_tmp/$f1
        else
          mv sorted/$f1 sorted_tmp
        fi
      done
      rm -rf sorted
      mv sorted_tmp sorted
    done
    mv sorted/* sorted_uniq_file
    rm -rf sorted
    

    The solution above can be optimized to merge more that 2 files at the same time.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm currently working on an app using MVVM that needs to have some data
I currently have a page defined which displays some data in rows. At the
I'm tinkering with Silverlight 2.0. I have some images, which I currently have a
I'm using Zend_Search_Lucene, the PHP port of Java Lucene. I currently have some code
I currently have a web page where the user enters some data and then
Currently we have a batch driven process at work which runs every 15 mins
We currently have some c# code that runs and imports data from a number
I currently have a request to build a shell script to get some data
I currently started to work with octave for some data analysis and have some
I currently have a form set up that passes some data to a php

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.