I’m trying to use the package RSS with UTF8 string with no avail. (i don’t want to use HXT which works, i just want to understand where i’m wrong)
In ghci when i put “test” i just get garbage with character such as “é”.
If i get the string from reading a file with UTF8.readFile and send it to parseFromString it works, but when i download and use getRespBody it doesn’t.
Here is my sample code :
import Network.HTTP (simpleHTTP, getRequest, getResponseBody)
import Data.Maybe (fromJust)
import Text.Feed.Import (parseFeedString)
import Text.RSS.Syntax
import Text.Feed.Types (Feed(..))
import Prelude hiding (putStrLn)
import Data.ByteString.Char8 (putStrLn)
import Data.ByteString.UTF8 (fromString)
siteUrl = "http://radiofrance-podcast.net/podcast09/rss_11549.xml"
type Links = [(String,String,String)]
-------------------------------------------------------------------------------
-- Main function
-------------------------------------------------------------------------------
test = getLinks siteUrl >>= mapM_ (putStrLn.fromString)
-------------------------------------------------------------------------------
-- Retrieve titles
-------------------------------------------------------------------------------
getLinks:: String -> IO [String]
getLinks url = simpleHTTP (getRequest url) >>= getResponseBody >>= parseDoc
parseDoc d = do
let RSSFeed rss = (fromJust . parseFeedString ) d
items = rssItems.rssChannel $ rss
titles = map (fromJust.rssItemTitle) items
return $ titles
Update:
thanks to Roman’s answer, i have modified my code. Here are the modification for anyone who may be interested.
import Codec.Binary.UTF8.String (decodeString) -- <-- added
getLinks:: String -> IO [String]
getLinks url = simpleHTTP (getRequest url) >>= getResponseBody >>= parseDoc.decodeString -- <-- modified
The fact that
simpleHTTPmay returnString-based responses is a bit confusing. In reality they are not Unicode strings, but byte strings that contain the HTTP response as is. No automatic decoding is done.So, you need to decode the http response before passing it to feed parsing functions (e.g. using the
encodingorutf8-stringpackage).You probably want to extract the source encoding information from the
Content-Typehttp header or from the RSS document itself.