In reddit URL, there is “5 characternumerics” thing_id part (for example, “wplf7” from “http://redd.it/wplf7”) which is generated by base36.
wplf7 is generated from number 54941875 – this is what I found so far… I’m wondering how 54941875 is generated.
I’m trying to scrape comment of a reddit’s specific section (let’s say http://www.reddit.com/r/leagueoflegends/) using R and I’m stuck at this 5 character numerics.
Anyone who can explain this in the simple manner? Unfortunately Python is not my domain and 2000 lines of python code listed on Reddit’s website didn’t help me much.
Thanks,
Firstly set an uniqueish user agent as reddit likes this
I assumme you want to get the content at http://www.reddit.com/r/leagueoflegends/ . You need to append a
.jsonto the url:Obviously the content is very rich for example the domains,permalinks,authors, titles of posts:
To investigate how these ids are generated we can apply @Ben Bolker s
base36ToIntegerfunction to the ids we have gathered and compare them against the date they were created giving:which implies that reddit generates these numbers sequentially across the site as new posts are created.
Without a specific direction I will leave it at this but hopefully you get the idea.