I need to take a string, and shorten it to 140 characters. Currently I

Question

0

Editorial Team

Asked: May 12, 20262026-05-12T22:30:33+00:00 2026-05-12T22:30:33+00:00

I need to take a string, and shorten it to 140 characters. Currently I

0

I need to take a string, and shorten it to 140 characters.

Currently I am doing:

if len(tweet) > 140:
    tweet = re.sub(r"\s+", " ", tweet) #normalize space
    footer = "… " + utils.shorten_urls(post['url'])
    avail = 140 - len(footer)
    words = tweet.split()
    result = ""
    for word in words:
        word += " "
        if len(word) > avail:
            break
        result += word
        avail -= len(word)
    tweet = (result + footer).strip()
    assert len(tweet) <= 140

So this works great for English, and English like strings, but fails for a Chinese string because tweet.split() just returns one array:

>>> s = u"简讯：新華社報道，美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域，預計約30分鐘後抵達浦東國際機場，開展他上任後首次訪華之旅。"
>>> s
u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002'
>>> s.split()
[u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002']

How should I do this so it handles I18N? Does this make sense in all languages?

I’m on python 2.5.4 if that matters.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-12T22:30:34+00:00

After speaking with some native Cantonese, Mandarin, and Japanese speakers it seems that the correct thing to do is hard, but my current algorithm still makes sense to them in the context of internet posts.

Meaning, they are used to the “split on space and add … at the end” treatment.

So I’m going to be lazy and stick with it, until I get complaints from people that don’t understand it.

The only change to my original implementation would be to not force a space on the last word since it is unneeded in any language (and use the unicode character … &#x2026 instead of … three dots to save 2 characters)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to take a string, and shorten it to 140 characters. Currently I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply