I am extracting information about certain companies from Reuters using Python. I have been able to get the officer/executive names, biographies, and compensation from this page
Now, I want to extract previous position titles and companies from the biography section, which looks something like this:
Mr. Donald T. Grimes is Senior Vice President, Chief Financial Officer and Treasurer of Wolverine World Wide, Inc., since May 2008. From 2007 to 2008, he was the Executive Vice President and Chief Financial Officer for Keystone Automotive Operations, Inc., a distributor of automotive accessories and equipment. Prior to Keystone, Mr. Grimes held a series of senior corporate and divisional finance roles at Brown-Forman Corporation, a manufacturer and marketer of premium wines and spirits. During his employment at Brown-Forman, Mr. Grimes was Vice President, Director of Beverage Finance from 2006 to 2007; Vice President, Director of Corporate Planning and Analysis from 2003 to 2006; and Senior Vice President, Chief Financial Officer of Brown-Forman Spirits America from 1999 to 2003.
I can use simple regex to get the from and to years, but I am at a loss on how to write regex to get the titles and the company name as well. I know the string format is inconsistent, so I would take an answer that works for at least 70% of cases. Here’s the output I would like:
2007-2008, executive vice president and chief financial officer, Keystone Automotive operations
The problem you are trying to solve is well known and researched, and you will find a large amount of research paper describing approaches and algorithms if you google for the terms “Named Entity Extraction” and “Relationship Extraction” Some good starting points are:
Chapter 7 of the book “Natural Language Processing with Python”, in fact that entire book would probably be helpful. Chapter online here
This paper on “Named Entity Relation Mining using Wikipedia”
This paper “ddNovel Algorithms for Relationship Mining which describes mining job titles and organizations as one of the examples.
These are just a few links I’ve found interesting, there are a ton more and probably better ones than these, but this should get you started.