Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7502683
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 29, 20262026-05-29T20:57:21+00:00 2026-05-29T20:57:21+00:00

I have a system that tracks what documents users view. Each document has its

  • 0

I have a system that tracks what documents users view. Each document has its ID and a cluster that it belongs to. My system tracks the session ID and the number of views. I would now like to construct an SQL query which would give me two columns – the session ID and the classified cluster. The algorithm for classification is simple:

1. select all sessions
2. for each session S
   I. prepare an accumulator ACC for clusters
   II. select the clusters of viewed documents for this session
   III. for each cluster C accumulate the cluster count ( ACC[C]++ )
   IV. find the maximum in the ACC. That is the cluster that the session was classified to

The table structures are as follows, I’m using MySQL 5.5.16:

Session

+-------+-----------+--------------------+
| ID    | sessionID | classified_cluster |
+-------+-----------+--------------------+

SessionDocument

+-------+-----------+------------+
| ID    | sessionID | documentID |
+-------+-----------+------------+

Cluster

+-------+-------+
| ID    | label |
+-------+-------+

ClusterDocument

+-------+-----------+------------+
| ID    | clusterID | documentID |
+-------+-----------+------------+

So basically, I want to select the clusters for each session, count the occurrence of each cluster for viewed documents and find the maximum occurrence. Then the ID of the cluster that occurred the most, is the result for the session therefore the final result set holds the session ID and the most occurred cluster:

Result

+-----------+-----------------------+
| sessionID | classifiedIntoCluster |
+-----------+-----------------------+

I managed to get the clusters of viewed documents for each session (step 2/II.) with this query:

SELECT SD.session_id, CD.cluster_id 
FROM cluster_document AS CD 
INNER JOIN session_document AS SD 
ON CD.document_id = SD.document_id
WHERE session_id IN (SELECT session_id FROM session) 

I’m having trouble figuring out the rest. Is this even possible with nested SELECT queries? Should I use a cursor, and if yes, could someone show an example with a cursor? Any help will be much appreciated.

EDIT #1: added a C# implementation, MySQL dump and expected result

C# implementation

    private void ClassifyUsers() {
        int nClusters = Database.SelectClusterCount(); //get number of clusters
        DataSet sessions = Database.SelectSessions(); //get all sessions
        foreach (DataRow session in sessions.Tables[0].Rows) { //foreach session
           int[] acc = new int[nClusters]; //prepare an accumulator for each known cluster
           string s_id = session["session_id"].ToString();
           DataSet sessionClusters = Database.SelectSessionClusters(s_id); //get clusters for this session

           foreach (DataRow cluster in sessionClusters.Tables[0].Rows) { //for each cluster
               int c = Convert.ToInt32(cluster["cluster_id"].ToString()) - 1;
               acc[c]++; //accumulate the cluster count
           }

           //find the maximum in the accumulator -> that is the most relevant cluster
           int max = 0;
           for (int j = 0; j < acc.Length; j++) {
               if (acc[j] >= acc[max]) max = j;
           }
           max++;
           Database.UpdateSessionCluster(s_id, max); //update the session with its new assigned cluster
       }
    }

Table structure, test data and expected result

Table structure and test data

Expected result

EDIT #2: added a smaller data set and further algorithm walkthrough

Here is a smaller data set:

SESSION

session id    |  cluster
abc                 0
def                 0
ghi                 0
jkl                 0       
mno                 0

CLUSTER

cluster_id  | label
1               A
2               B
3               C
4               D
5               E

SESSION_DOCUMENT

id      | session_id    |   document_id
1           abc             1
2           def             5
3           jkl             3
4           ghi             4
5           mno             2
6           def             2
7           abc             5
8           ghi             3

CLUSTER_DOCUMENT

id      | cluster_id    |   document_id
1           1                  2
2           1                  3
3           2                  5
4           3                  5
5           3                  1
6           4                  3
7           5                  2
8           5                  4

Algorithm in detail

Step 1: get clusters for documents viewed by the session

session_id  |  cluster_id   | label     | document_id   
abc             3               C           1
abc             2               B           5
abc             3               C           5
-----
def             2               B           5
def             3               C           5   
def             1               A           2
def             5               E           2   
----
ghi             5               E           4   
ghi             1               A           3   
ghi             4               D           3   
----
jkl             1               A           3   
jkl             4               D           3   
----
mno             1               A           2
mno             5               E           2

Step 2: count occurrence of clusters

session_id |    cluster_id  | label |   occurrence
abc             3               C           2   <--- MAX
abc             2               B           1
----
def             2               B           1
def             3               C           1   
def             1               A           1
def             5               E           1   <--- MAX
----
ghi             5               E           1   
ghi             1               A           1   
ghi             4               D           1   <--- MAX
----
jkl             1               A           1   
jkl             4               D           1   <--- MAX
----
mno             1               A           1   
mno             5               E           1   <--- MAX

Step 3 (end result): find maximum occurred cluster for each session (see above) and construct the final result set (session_id, cluster_id):

session_id |    cluster_id  
abc                 3           
def                 5
ghi                 4
jkl                 4
mno                 5

EDIT #3: Accepted answer clarification

Both given answers are correct. They both provide a solution for the problem. I gave Mosty Mostacho the accepted answer because he delivered the solution first and provided another version of the solution with a VIEW. The solution from mankuTimma is of the same quality as Mosty Mostacho’s solution. Therefore, we have two equally good solutions, I just picked Mosty Mostacho because he was first.

Thanks to both of them for their contributions.
.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-29T20:57:23+00:00Added an answer on May 29, 2026 at 8:57 pm

    Well I have some doubts about how to choose the an occurrence when there are many equal, but looking at the C# code it seems this selection is non-deterministic.

    Now, given the sampla data Step 2 actually results in:

    +------------+------------+-------+------------+
    | SESSION_ID | CLUSTER_ID | LABEL | OCCURRENCE |
    +------------+------------+-------+------------+
    | abc        |          3 | C     |          2 |
    | def        |          1 | A     |          1 |
    | def        |          2 | B     |          1 |
    | def        |          3 | C     |          1 |
    | def        |          5 | E     |          1 |
    | ghi        |          1 | A     |          1 |
    | ghi        |          4 | D     |          1 |
    | ghi        |          5 | E     |          1 |
    | jkl        |          1 | A     |          1 |
    | jkl        |          4 | D     |          1 |
    | mno        |          1 | A     |          1 |
    | mno        |          5 | E     |          1 |
    +------------+------------+-------+------------+
    

    So, continuing with this data, I get the session_id and the max(cluster_id) for that session id, resulting in:

    +------------+------------+
    | SESSION_ID | CLUSTER_ID |
    +------------+------------+
    | abc        |          3 |
    | def        |          5 |
    | ghi        |          5 |
    | jkl        |          4 |
    | mno        |          5 |
    +------------+------------+
    

    The max(cluster_id) is just there to perform that non-deterministic selection. This is the query:

    select s1.session_id, max(s1.cluster_id) as cluster_id from (
      select sd.session_id, cd.cluster_id, count(*) as Occurrence
      from session_document sd
      join cluster_document cd
      on sd.document_id = cd.document_id
      join cluster c
      on c.cluster_id = cd.cluster_id
      group by sd.session_id, cd.cluster_id, c.label
    ) as s1
    left join (
      select sd.session_id, count(*) as Occurrence
      from session_document sd
      join cluster_document cd
      on sd.document_id = cd.document_id
      join cluster c
      on c.cluster_id = cd.cluster_id
      group by sd.session_id, cd.cluster_id, c.label
    ) as s2
    on s1.session_id = s2.session_id and s1.occurrence < s2.occurrence
    where s2.occurrence is null
    group by s1.session_id
    

    Maybe adding a view would improve performance (replacement of above query):

    create view MaxOccurrences as (
      select sd.session_id, cd.cluster_id, count(*) as Occurrence
      from session_document sd
      join cluster_document cd
      on sd.document_id = cd.document_id
      join cluster c
      on c.cluster_id = cd.cluster_id
      group by sd.session_id, cd.cluster_id, c.label
    );
    
    select s1.session_id, max(s1.cluster_id) as cluster_id
    from MaxOccurrences as s1
    left join MaxOccurrences as s2
    on s1.session_id = s2.session_id and s1.occurrence < s2.occurrence
    where s2.occurrence is null
    group by s1.session_id
    

    Let me know if it works.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

assuming that you have the following sentence in a requirements document The system has
We have a system that has a Mysql database with about 2Gigs of data.
We have system here that uses Java JNI to call a function in a
I have a system that combines the best and worst of Java and PHP.
I have a system that uses a meta refresh to a logout page, which
I have a system that creates an order and that order can be billed
I have a system that compares fiscal year up to the current date to
I have a system that takes dynamic data, puts it in HTML for layout
I have big system that make my system crash hard. When I boot up,
I have a system that runs like this: main.exe runs sub.exe runs sub2.exe and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.