So I have a table with two columns “title” and “url”. The rows go as such:
Title url
Galago - Wikipedia http://en.wikipedia.org/wiki/Galago
Characteristics - Wikipedia http://en.wikipedia.org/wiki/Galago
Classification - Wikipedia http://en.wikipedia.org/wiki/Galago
Myst- Gamestop http://www.gamestop.com/ds/games/myst/69424
Plot- Gamestop http://www.gamestop.com/ds/games/myst/69424
my question is, how would I remove the common characters that are present in all rows from a certain url (remove – Wikipedia from the first three, and – Gamestop from the other 2). This is just a minor example….I have many other rows that have the same pattern (they have common characters, words, that reoccur in all of the rows from a certain url). I wanted to add that I store these values from a javacript array
I think that most automated solutions to this risk removing data that you want to keep. A word or phrase that occurs on more than one row is not necessarily redundant. A couple of potential, but still unreliable, methods come to mind. These would work only if you are looking for whole words.
Read all the titles into an array, and create a wordlist array by splitting each title into words. You can then determine the frequency of each word, and use that information to remove the unwanted words from the titles. If you have a lot of data, this method could use a lot of memory…
Parse each URL, extract the hostname, split it using a period (.) As the delimiter, and then search for and remove occurrences of those strings from the title. You might choose to create a whitelist of strings to ignore, like www, com, co, uk, net, org, and so on. This method may work if the unwanted words are found in the domain name (as in your examples).