I can’t seem to find a question on SO about my particular problem, so forgive me if this has been asked before!
Anyway, I’m writing a script to loop through a set of URL’s and give me a list of unique urls with unique parameters.
The trouble I’m having is actually comparing the parameters to eliminate multiple duplicates. It’s a bit hard to explain, so some examples are probably in order:
Say I have a list of URL’s like this
- hxxp://www.somesite.com/page.php?id=3&title=derp
- hxxp://www.somesite.com/page.php?id=4&title=blah
- hxxp://www.somesite.com/page.php?id=3&c=32&title=thing
- hxxp://www.somesite.com/page.php?b=33&id=3
I have it parsing each URL into a list of lists, so eventually I have a list like this:
sort = [['id', 'title'], ['id', 'c', 'title'], ['b', 'id']]
I nee to figure out a way to give me just 2 lists in my list at that point:
new = [['id', 'c', 'title'], ['b', 'id']]
As of right now I’ve got a bit to sort it out a little, I know I’m close and I’ve been slamming my head against this for a couple days now :(. Any ideas?
Thanks in advance! 🙂
EDIT: Sorry for not being clear! This script is aimed at finding unique entry points for web applications post-spidering. Basically if a URL has 3 unique entry points
['id', 'c', 'title']
I’d prefer that to the same link with 2 unique entry points, such as:
['id', 'title']
So I need my new list of lists to eliminate the one with 2 and prefer the one with 3 ONLY if the smaller variables are in the larger set. If it’s still unclear let me know, and thank you for the quick responses! 🙂
I’ll assume that subsets are considered “duplicates” (non-commutatively, of course)…
Start by converting each query into a set and ordering them all from largest to smallest. Then add each query to a new list if it isn’t a subset of an already-added query. Since any set is a subset of itself, this logic covers exact duplicates: