Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 880309
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T12:04:35+00:00 2026-05-15T12:04:35+00:00

This is a double question for you amazingly kind Stacked Overflow Wizards out there.

  • 0

This is a double question for you amazingly kind Stacked Overflow Wizards out there.

  1. How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL? At the moment I cannot send any non-roman characters to swank-clojure, and using the command-line REPL garbles things.

  2. It’s really easy to do regular expressions on latin text:

    (re-seq #”[\w]+” “It’s really true that Japanese sentences don’t need spaces?”)

But what if I had some japanese? I thought that this would work, but I can’t test it:

(re-seq #"[(?u)\w]+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")

It gets harder if we have to use a dictionary to find word breaks, or to find a katakana-only word ourselves:

(re-seq #"[アイウエオ-ン]" "日本語の文章にはスペースが必要ないって、本当?")

Thanks!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T12:04:36+00:00Added an answer on May 15, 2026 at 12:04 pm

    Can’t help with swank or Emacs, I’m afraid. I’m using Enclojure on NetBeans and it works well there.

    On matching: As Alex said, \w doesn’t work for non-English characters, not even the extended Latin charsets for Western Europe:

    (re-seq #"\w+" "prøve")  =>("pr" "ve")   ; Norwegian
    (re-seq #"\w+" "mañana") => ("ma" "ana") ; Spanish
    (re-seq #"\w+" "große")  => ("gro" "e")  ; German
    (re-seq #"\w+" "plaît")  => ("pla" "t")  ; French
    

    The \w skips the extended chars. Using [(?u)\w]+ instead makes no difference, same with the Japanese.

    But see this regex reference: \p{L} matches any Unicode character in category Letter, so it actually works for Norwegian

    (re-seq #"\p{L}+" "prøve")
    => ("prøve")
    

    as well as for Japanese (at least I suppose so, I can’t read it but it seems to be in the ballpark):

    (re-seq #"\p{L}+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")
    => ("日本語" "の" "文章" "に" "は" "スペース" "が" "必要" "ない" "って" "本当")
    

    There are lots of other options, like matching on combining diacritical marks and whatnot, check out the reference.

    Edit: More on Unicode in Java

    A quick reference to other points of potential interest when working with Unicode.

    Fortunately, Java generally does a very good job of reading and writing text in the correct encodings for the location and platform, but occasionally you need to override it.

    This is all Java, most of this stuff does not have a Clojure wrapper (at least not yet).

    • java.nio.charset.Charset – represents a charset like US-ASCII, ISO-8859-1, UTF-8
    • java.io.InputStreamReader – lets you specify a charset to translate from bytes to strings when reading. There is a corresponding OutputStreamWriter.
    • java.lang.String – lets you specify a charset when creating a String from an array of bytes.
    • java.lang.Character – has methods for getting the Unicode category of a character and converting between Java chars and Unicode code points.
    • java.util.regex.Pattern – specification of regexp patterns, including Unicode blocks and categories.

    Java characters/strings are UTF-16 internally. The char type (and its wrapper Character) is 16 bits, which is not enough to represent all of Unicode, so many non-Latin scripts need two chars to represent one symbol.

    When dealing with non-Latin Unicode it’s often better to use code points rather than characters. A code point is one Unicode character/symbol represented as an int. The String and Character classes have methods for converting between Java chars and Unicode code points.

    • unicode.org – the Unicode standard and code charts.

    I’m putting this here since I occasionally need this stuff, but not often enough to actually remember the details from one time to the next. Sort of a note to my future self, and it might be useful to others starting out with international languages and encodings as well.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This may seem like a realy basic question but... How do you use double
This is kind of a 'double' question that might have a single answer. I'm
From this question I learned Double.NaN is not equal to itself. I was verifying
This question is about the threshold at which Math.Floor(double) and Math.Ceiling(double) decide to give
This is probably a naive question - but I want to double check to
While answering this question, I got these confusing results: double d = 0.49999999999999990d; //output
Now the question is pretty hard. Now this is my main list List<List<KeyValuePair<string, double>>>
Sorry for the double post, I will update this question if I can't get
What does the following expression return in Java? Math.max(Float.POSITIVE_INFINITY, Double.POSITIVE_INFINITY); I saw this question
Okay so this is sort of a double question so I'll split it into

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.