Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 564783
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T12:48:11+00:00 2026-05-13T12:48:11+00:00

I want to know the effective algorithms/data structures to identify the below information in

  • 0

I want to know the effective algorithms/data structures to identify the below information in streaming data.

Consider a real-time streaming data like twitter. I am mainly interested in the below queries rather than storing the actual data.

I need my queries to run on actual data but not any of the duplicates.

As I am not interested in storing the complete data, it will be difficult for me to identify the duplicate posts. However, I can hash all the posts and check against them. But I would like to identify near duplicate posts also. How can I achieve this.

Identify the top k topics being discussed by the users.

I want to identify the top topics being discussed by users. I don’t want the top frequency words as shown by twitter. Instead I want to give some high level topic name of the most frequent words.

I would like my system to be real-time. I mean, my system should be able to handle any amount of traffic.

I can think of map reduce approach but I am not sure how to handle synchronization issues. For example, duplicate posts can reach different nodes and both of them could store them in the index.

In a typical news source, one will be removing any stop words in the data. In my system I would like to update my stop words list by identifying top frequent words across a wide range of topics.

What will be effective algorithm/data structure to achieve this.

I would like to store the topics over a period of time to retrieve interesting patterns in the data. Say, friday evening everyone wants to go to a movie. what will be the efficient way to store this data.

I am thinking of storing it in hadoop distributed file system, but over a period of time, these indexes become so large that I/O will be my major bottleneck.

Consider multi-lingual data from tweets around the world. How can I identify similar topics being discussed across a geographical area?

There are 2 problems here. One is identifying the language being used. It can be identified based on the person tweeting. But this information might affect the privacy of the users. Other idea, could be running it through a training algorithm. What is the best method currently followed for this. Other problem is actually looking up the word in a dictionary and associating it to common intermediate language like say english. How to take care of word sense disambiguation like a same word being used in different contests.

Identify the word boundaries

One possibility is to use some kind of training algorithm. But what is the best approach followed. This is some way similar to word sense disambiguation, because you will be able to identify word boundaries based on the actual sentence.

I am thinking of developing a prototype and evaluating the system rather than the concrete implementation. I think its not possible to scrap the real-time twitter data. I am thinking this approach can be tested on some data freely available online. Any ideas, where I can get this data.

Your feedback is appreciated.

Thanks for your time.

— Bala

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T12:48:11+00:00Added an answer on May 13, 2026 at 12:48 pm

    There are a couple different questions buried in here. I can’t understand all that you’re asking, but here’s a the big one as I understand it: You want to categorize messages by topic. You also want to remove duplicates.

    Removing duplicates is (relatively) easy. To remove “near” duplicates, you could first remove uninteresting parts from your data. You could start by removing capitalization and punctuation. You could also remove the most common words. Then you could add the resulting message to a Bloom filter. Hashing isn’t good enough for Twitter, as the hashed messages wouldn’t be much smaller than the full messages. You’d end up with a hash that doesn’t fit in memory. That’s why you’d use a Bloom filter instead. It might have to be a very large Bloom filter, but it will still be smaller than the hash table.

    The other part is a difficult categorization problem. You probably do not want to write this part yourself. There are a number of libraries and programs available for categorization, but it might be hard to find one that fits your needs. An example is the Vowpal Wabbit project, which is a fast online algorithm for categorization. However, it only works on one category at a time. For multiple categories, you would have to run multiple copies and train them separately.

    Identifying the language sounds less difficult. Don’t try to do something smart like “training”, instead put the most common words from each language in a dictionary. For each message, use the language whose words appeared most frequently.

    If you want the algorithm to come up with categories on its own, good luck.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

i want to know if there is a way to get the previous directory
I have the following C# code fragment: using System; class count { public static
By way of explanation, take this value type in C#: struct ObjRef { public
We currently use PDForm to grab a blank pdf file (no values, just form
I'm just looking for an answer on the net but I haven't reached any
HI All I was hoping someone could help me improve a query I have
You can actually just skip this first part cause it gets confusing, but I'll
Is the Visitor Pattern the fastest way to accomplish method parameter type identification (effectively
Ok I admit, what I need is not a tabpanel, (or is it?). But
I've compiled an MTS video into MP4 format using FFMPEG. The video is 2

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.