I know it’s impossible to prevent people from stealing our data, but I have a large database and I want to at least prevent automated scripts from stealing my database.
My ideas so far:
- use JavaScript or encode HTML = heavy and could easily be decoded
- recaptcha for the search = no way, users will just leave my website
- inserting random data and tags in the site HTML to avoid regex rip = good?
Any ideas are appreciated.
I think Alexa inserts random tags into the markup, and it gave me a heck of a time when I tried to mine it… they put some extra tags in the Alexa rankings, like
<span class="a5r">35</span><span class="et4">52</span><span class="arer">16</span>and unless you downloaded the style sheet too and looked at the rendering rules, you couldn’t figure out what number that was supposed to be.But… if I was patient enough, I could have “rendered” the numbers and then mined it, but it just wasn’t worth it for me. Limiting page requests to a humanly possible amount would probably work well (50/min or something).