I’m a beginner in xquery and I hope you can help me with an easy explanation. I’m using BaseX 7.0.1.
I have a dictionary.xml file that looks like this :
<doc>
<entry>
<vedette>je</vedette>
<variante>je</variante>
<variante>j'</variante>
<partiedudiscours>pronom</partiedudiscours>
</entry>
</doc>
And I have another malone_fr.xml file that contains the text that I’d like to annotate, that looks like this :
<doc>
L’Opportunité
Par : Walter Malone (1866-1915)
Ils ont mal conclu ceux qui disent que je ne reviendrai plus
Quand une fois j’ai frappé à ta porte et ne t’ai pas rencontré,
</doc>
So I’d like to compare the content of the < variante > part of dictionary.xml with my text, and markup the text with the content of < partiedudiscours >.
So far, I’ve been able to do that with this code :
let $comp := data(for $j in tokenize(for $i in db:open('malone_fr')/doc return $i,"\n")
return tokenize($j," "))
for $aa in $comp
return
for $lemme in db:open('dictionnaire')/doc/entry
return
let $oldName :=$aa
return
if ($oldName= $lemme/variante)
then
let $newName := element {$lemme/partiedudiscours} {$aa}
return
for $bb in $comp
return
if ($bb=$oldName)
then $newName
else ($bb)
else ()
That gives me the following result:
[first iteration]
L’Opportunité Par : Walter Malone (1866-1915) Ils<verbe>ont</verbe> mal conclu ceux qui disent que je ne reviendrai plus
[second iteration]
L’Opportunité Par : Walter Malone (1866-1915) <pronom>Ils</pronom>ont mal conclu ceux qui disent que je ne reviendrai plus
As you can see, it only shows the result per word by iteration, whereas I need a result with the whole text annotated like:
L’Opportunité Par : Walter Malone (1866-1915) <pronom>Ils</pronom><verbe>ont</verbe> <adverbe>mal</adverbe> <verb>conclu</verb>
Etc.
I don’t know how I can deal with the for-loop to do that.
Thanks in advance.
I think your solution is a little more complicated than it needs to be. You should be able to do this in one loop. Using XPath to perform the lookup – instead of explicitly looping over all the values in your dictionary – will allow your database to optimize for faster retrieval of the dictionary data.
Also, the
tokenize()step discards spaces, so no spaces exist in your output sequence. It will only appear spaced because that is typically the default method of rendering a sequence of atomic types; however, as you can see from your test output, spaces are not rendered around elements. In the above solution I added very basic space handling so elements are also correctly spaced. You can remove thetext{" "}nodes if they are not needed.Update: added @DennisKnochenwefel’s suggestion