Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 479287
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T00:46:39+00:00 2026-05-13T00:46:39+00:00

I naively imagined that I could build a suffix trie where I keep a

  • 0

I naively imagined that I could build a suffix trie where I keep a visit-count for each node, and then the deepest nodes with counts greater than one are the result set I’m looking for.

I have a really really long string (hundreds of megabytes). I have about 1 GB of RAM.

This is why building a suffix trie with counting data is too inefficient space-wise to work for me. To quote Wikipedia’s Suffix tree:

storing a string’s suffix tree typically requires significantly more space than storing the string itself.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

And that was wikipedia’s comments on the tree, not trie.

How can I find long repeated sequences in such a large amount of data, and in a reasonable amount of time (e.g. less than an hour on a modern desktop machine)?

(Some wikipedia links to avoid people posting them as the ‘answer’: Algorithms on strings and especially Longest repeated substring problem 😉 )

  • 1 1 Answer
  • 1 View
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T00:46:39+00:00Added an answer on May 13, 2026 at 12:46 am

    The effective way to do this is to create an index of the sub-strings, and sort them. This is an O(n lg n) operation.

    BWT compression does this step, so its a well understood problem and there are radix and suffix (claim O(n)) sort implementations and such to make it as efficient as possible. It still takes a long time, perhaps several seconds for large texts.

    If you want to use utility code, C++ std::stable_sort() performs much better than std::sort() for natural language (and much faster than C’s qsort(), but for different reasons).

    Then visiting each item to see the length of its common substring with its neighbours is O(n).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am fairly new to stored procedures. I naively thought that I could build
I had (perhaps naively) assumed that in SQL Server, an nvarchar would store each
Coding footer naively, if there's not enough content, then there will be empty space
I naively thought I could use memset for this, but apparently memset is only
I remember a few years ago(2002) there was a multipartite virus that could be
I naively implemented the trie from Okasaki's book Purely Functional Data Structures in Clojure.
This is one of those questions that seems to fall into the naively obvious
Here are two programs that naively calculate the number of prime numbers <= n.
Since Knockout's individual templates are kept in script tags, I thought that I could
If you naively follow the steps on the GraphicsMagick website , after running ./configure

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.