Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8283535
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T10:46:03+00:00 2026-06-08T10:46:03+00:00

Input: tableapplechairtablecupboard… many words What would be an efficient algorithm to split such text

  • 0

Input: "tableapplechairtablecupboard..." many words

What would be an efficient algorithm to split such text to the list of words and get:

Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]

First thing that cames to mind is to go through all possible words (starting with first letter) and find the longest word possible, continue from position=word_position+len(word)

P.S.
We have a list of all possible words.
Word “cupboard” can be “cup” and “board”, select longest.
Language: python, but main thing is the algorithm itself.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T10:46:04+00:00Added an answer on June 8, 2026 at 10:46 am

    A naive algorithm won’t give good results when applied to real-world data. Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text.

    (If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by “longest word”: is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.)

    The idea

    The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf’s law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.

    Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it’s easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.

    The code

    from math import log
    
    # Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
    words = open("words-by-frequency.txt").read().split()
    wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
    maxword = max(len(x) for x in words)
    
    def infer_spaces(s):
        """Uses dynamic programming to infer the location of spaces in a string
        without spaces."""
    
        # Find the best match for the i first characters, assuming cost has
        # been built for the i-1 first characters.
        # Returns a pair (match_cost, match_length).
        def best_match(i):
            candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
            return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
    
        # Build the cost array.
        cost = [0]
        for i in range(1,len(s)+1):
            c,k = best_match(i)
            cost.append(c)
    
        # Backtrack to recover the minimal-cost string.
        out = []
        i = len(s)
        while i>0:
            c,k = best_match(i)
            assert c == cost[i]
            out.append(s[i-k:i])
            i -= k
    
        return " ".join(reversed(out))
    

    which you can use with

    s = 'thumbgreenappleactiveassignmentweeklymetaphor'
    print(infer_spaces(s))
    

    The results

    I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.

    Before: thumbgreenappleactiveassignmentweeklymetaphor.
    After: thumb green apple active assignment weekly metaphor.

    Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen
    odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho
    rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery
    whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.

    After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.

    Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.

    After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

    As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.


    Optimization

    The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.

    If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Input: var text = $('#id_textarea').val(); var keywords = $('#id_keywords').val().split(','); Output: Number of keywords in
Input: I have a LaTeX file, with plain text & math formulas. Desired output:
Input: var str = 'text {{value1}} text {{value2}}text{{value1}}'; Regex: var result = str.match(/({{\S+}})+/ig); Output:
Input: Dynamic Text (PDF 120 KB) Output: Dynamic Text
Input: two lists (list1, list2) of equal length Output: one list (result) of the
Input >> list = [[1,2,3], [6], [3,4,5,6]] Output >> [1,2,3,3,4,5,6,6] I want to know
Input: List of keys: [ :name :address :work] Map 1: { :name A :address
Input : {5, 13, 6, 5, 13, 7, 8, 6, 5} Output : {5,
Input: SHC 111U,SHB 22x,, SHA 5555G Needed output: SHB 22X, SHC 111U, SHA 5555G
Input col1, col2, col3, coln Output @col1, @col2, @col3, @coln

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.