I have a program that is parsting tweets in real time from the twitter

Question

0

Asked: June 9, 20262026-06-09T20:29:19+00:00 2026-06-09T20:29:19+00:00

I have a program that is parsting tweets in real time from the twitter

0

I have a program that is parsting tweets in real time from the twitter stream api. Before storing them, I am encoding them as utf8. Certain characters end up appearing in the string as ?, ??, or ??? instead of their respective unicode codes and cause problems. Upon further investigation, I found that the problematic characters are from the “emoticon” block, U+1F600 – U+1F64F, and the “Miscellaneous Symbols And Pictographs” block, U+1F300 – U+1F5FF. I tried removing, but was unsuccessful as the matcher ended up replacing almost every character in the string, not just my desired unicode range.

String utf8tweet = "";
        try {
            byte[] utf8Bytes = status.getText().getBytes("UTF-8");

            utf8tweet = new String(utf8Bytes, "UTF-8");

        } 
        catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
Pattern unicodeOutliers = Pattern.compile("[\\u1f300-\\u1f64f]", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");

What can I do to remove these characters?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T20:29:20+00:00

In the regex pattern add the negation operator ^. For filtering printable characters you could use the following expression [^\\x00-\\x7F] and you should get the desired result.

import java.io.UnsupportedEncodingException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class UTF8 {
    public static void main(String[] args) {
        String utf8tweet = "";
        try {
            byte[] utf8Bytes = "#Hello twitter  How are you?".getBytes("UTF-8");

            utf8tweet = new String(utf8Bytes, "UTF-8");

        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
        Pattern unicodeOutliers = Pattern.compile("[^\\x00-\\x7F]",
                Pattern.UNICODE_CASE | Pattern.CANON_EQ
                        | Pattern.CASE_INSENSITIVE);
        Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);

        System.out.println("Before: " + utf8tweet);
        utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
        System.out.println("After: " + utf8tweet);
    }
}

Results in the following output:

Before: #Hello twitter  How are you?
After: #Hello twitter   How are you?

EDIT

To explain further, you could also keep expressing the range with the \u form in the following way [^\\u0000-\\u007F], which will match all the characters which are not the first 128 UNICODE characters (the same as before). If you want to extend the range to support extra characters, you can do so using the UNICODE character list here.

For example if you want to include vowels with accent (used in Spanish) you should extend the range to \u00FF, so you have [^\\u0000-\\u00FF] or [^\\x00-\\xFF]:

Before: #Hello twitter  How are you? á é í ó ú
After: #Hello twitter   How are you? á é í ó ú

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a program that is parsting tweets in real time from the twitter

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply