I’m attempting to extract urls from a string, they aren’t standardized so some are within href tags, others on their own.
Also I need them to be sorted by type, so for example the following strings:
var txt1: String = "Some text! <a href="http://www.google.com/test.mp3">MP3</a>"
var txt2: String = "Some text! <a href="http://www.google.com/test.jpg">IMG</a>"
var txt3: String = "Some more! <a href="http://www.google.com/">Link!</a>"
So these strings are all concatenated and contain 3 urls, I’m looking for something along the lines of:
var result: List = List(
"mp3" -> List("http://www.google.com/test.mp3"),
"img" -> List("http://www.google.com/test.jpg"),
"url" -> List("http://www.google.com/")
)
I’ve looked into regex but have only go so far as to extract hrefs without defining types, and this also doesn’t retrieve urls on their own outside of tags
val hrefRegex = new Regex("""\<a.*?href=\"(http:.*?)\".*?\>.*?\</a>""");
val hrefs:List[String]= hrefRegex.findAllIn(txt1.mkString).toList;
Any help is much appreciated, thanks in advance 🙂
Assuming
val txt = txt1 + txt2 + txt3, you can wrap the text into an xml element as a string then parse it as XML and use the xml standard library to extract the anchors.Then you just need to post process until you have the data organized like you want: