I never really dealt with NLP but had an idea about NER which should

Question

0

Asked: June 4, 20262026-06-04T00:07:44+00:00 2026-06-04T00:07:44+00:00

I never really dealt with NLP but had an idea about NER which should

0

I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn’t it work or weather it can be extended.

The idea was to extract names of the main characters in a story through:

Building a dictionary for each word
Filling for each word a list with the words that appear right next to it in the text
Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)
Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive)

I ran the overly simple code (attached below) on Alice in Wonderland, which for “Alice” returns:

21 [‘Mouse’, ‘Latitude’, ‘William’, ‘Rabbit’, ‘Dodo’, ‘Gryphon’, ‘Crab’, ‘Queen’, ‘Duchess’, ‘Footman’, ‘Panther’, ‘Caterpillar’, ‘Hearts’, ‘King’, ‘Bill’, ‘Pigeon’, ‘Cat’, ‘Hatter’, ‘Hare’, ‘Turtle’, ‘Dormouse’]

Though it filters for upper case words (and receives “Alice” as the word to cluster around), originally there are ~500 upper case words, and it’s still pretty spot on as far as main characters goes.

It does not work that well with other characters and in other stories, though gives interesting results.

Any idea if this idea is usable, extendable or why does it work at all in this story for “Alice” ?

Thanks!

#English Name recognition
import re
import sys
import random
from string import upper

def mimic_dict(filename):
  dict = {}
  f = open(filename)
  text = f.read()
  f.close()
  prev = ""
  words = text.split()
  for word in words:
    m = re.search("\w+",word)
    if m == None:
      continue
    word = m.group()
    if not prev in dict:
      dict[prev] = [word]
    else :
      dict[prev] = dict[prev] + [word] 
    prev = word
  return dict

def main():
  if len(sys.argv) != 2:
    print 'usage: ./main.py file-to-read'
    sys.exit(1)

  dict = mimic_dict(sys.argv[1])
  upper = []
  for e in dict.keys():
    if len(e) > 1 and  e[0].isupper():
      upper.append(e)
  print len(upper),upper

  exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
  exclude = [ x  for x in exclude if dict.has_key(x)] 
  for s in exclude :
    del dict[s]

  scores = {}
  for key1 in dict.keys():
    max = 0
    for key2 in dict.keys():
      if key1 == key2 : continue
      a =  dict[key1]
      k =  dict[key2]
      diff = []
      for ia in a:
        if ia in k and ia not in diff:
          diff.append( ia)
      if len(diff) > max:
        max = len(diff)
        scores[key1]=(key2,max)
  dictscores = {}
  names = []
  for e in scores.keys():
    if scores[e][0]=="Alice" and e[0].isupper():
      names.append(e)
  print len(names), names     


if __name__ == '__main__':
  main()

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T00:07:45+00:00

From the looks of your program and previous experience with NER, I’d say this “works” because you’re not doing a proper evaluation. You’ve found “Hare” where you should have found “March Hare”.

The difficulty in NER (at least for English) is not finding the names; it’s detecting their full extent (the “March Hare” example); detecting them even at the start of a sentence, where all words are capitalized; classifying them as person/organisation/location/etc.

Also, Alice in Wonderland, being a children’s novel, is a rather easy text to process. Newswire phrases like “Microsoft CEO Steve Ballmer” pose a much harder problem; here, you’d want to detect

[ORG Microsoft] CEO [PER Steve Ballmer]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I never really dealt with NLP but had an idea about NER which should

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply