I’m trying to extract data from a few large textfiles containing entries about people.

Question

0

Asked: May 31, 20262026-05-31T11:44:37+00:00 2026-05-31T11:44:37+00:00

I’m trying to extract data from a few large textfiles containing entries about people.

0

I’m trying to extract data from a few large textfiles containing entries about people. The problem is, though, I cannot control the way the data comes to me.

It is usually in a format like this:

LASTNAME, Firstname Middlename (Maybe a Nickname)Why is this text hereJanuary, 25, 2012

Firstname Lastname 2001 Some text that I don’t care about

Lastname, Firstname blah blah … January 25, 2012 …

Currently, I am using a huge regex that splits all kindaCamelcase words, all words that have a month name tacked onto the end, and a lot of special cases for names. Then I use more regex to extract a lot of combinations for the name and date.

This seems sub-optimal.

Are there any machine-learning libraries for Python that can parse malformed data that is somewhat structured?

I’ve tried NLTK, but it could not handle my dirty data. I’m tinkering with Orange right now and I like it’s OOP style, but I’m not sure if I’m wasting my time.

Ideally, I’d like to do something like this to train a parser (with many input/output pairs):

training_data = (
  'LASTNAME, Firstname Middlename (Maybe a Nickname)FooBarJanuary 25, 2012',
   ['LASTNAME', 'Firstname', 'Middlename', 'Maybe a Nickname', 'January 25, 2012']
)

Is something like this possible or am I overestimating machine learning? Any suggestions will be appreciated, as I’d like to learn more about this topic.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T11:44:39+00:00

I ended up implementing a somewhat-complicated series of exhaustive regexes that encompassed every possible use case using text-based “filters” that were substituted with the appropriate regexes when the parser loaded.

If anyone’s interested in the code, I’ll edit it into this answer.

Here’s basically what I used. To construct the regular expressions out of my “language”, I had to make replacement classes:

class Replacer(object):
    def __call__(self, match):
        group = match.group(0)

        if group[1:].lower().endswith('_nm'):
            return '(?:' + Matcher(group).regex[1:]
        else:
            return '(?P<' + group[1:] + '>' + Matcher(group).regex[1:]

Then, I made a generic Matcher class, which constructed a regex for a particular pattern given the pattern name:

class Matcher(object):
    name_component =    r"([A-Z][A-Za-z|'|\-]+|[A-Z][a-z]{2,})"
    name_component_upper = r"([A-Z][A-Z|'|\-]+|[A-Z]{2,})"

    year = r'(1[89][0-9]{2}|20[0-9]{2})'
    year_upper = year

    age = r'([1-9][0-9]|1[01][0-9])'
    age_upper = age

    ordinal = r'([1-9][0-9]|1[01][0-9])\s*(?:th|rd|nd|st|TH|RD|ND|ST)'
    ordinal_upper = ordinal

    date = r'((?:{0})\.? [0-9]{{1,2}}(?:th|rd|nd|st|TH|RD|ND|ST)?,? \d{{2,4}}|[0-9]{{1,2}} (?:{0}),? \d{{2,4}}|[0-9]{{1,2}}[\-/\.][0-9]{{1,2}}[\-/\.][0-9]{{2,4}})'.format('|'.join(months + months_short) + '|' + '|'.join(months + months_short).upper())
    date_upper = date

    matchers = [
        'name_component',
        'year',
        'age',
        'ordinal',
        'date',
    ]

    def __init__(self, match=''):
        capitalized = '_upper' if match.isupper() else ''
        match = match.lower()[1:]

        if match.endswith('_instant'):
            match = match[:-8]

        if match in self.matchers:
            self.regex = getattr(self, match + capitalized)
        elif len(match) == 1:
        elif 'year' in match:
            self.regex = getattr(self, 'year')
        else:
            self.regex = getattr(self, 'name_component' + capitalized)

Finally, there’s the generic Pattern object:

class Pattern(object):
    def __init__(self, text='', escape=None):
        self.text = text
        self.matchers = []

        escape = not self.text.startswith('!') if escape is None else False

        if escape:
            self.regex = re.sub(r'([\[\].?+\-()\^\\])', r'\\\1', self.text)
        else:
            self.regex = self.text[1:]

        self.size = len(re.findall(r'(\$[A-Za-z0-9\-_]+)', self.regex))

        self.regex = re.sub(r'(\$[A-Za-z0-9\-_]+)', Replacer(), self.regex)
        self.regex = re.sub(r'\s+', r'\\s+', self.regex)

    def search(self, text):
        return re.search(self.regex, text)

    def findall(self, text, max_depth=1.0):
        results = []
        length = float(len(text))

        for result in re.finditer(self.regex, text):
            if result.start() / length < max_depth:
                results.extend(result.groups())

        return results

    def match(self, text):
        result = map(lambda x: (x.groupdict(), x.start()), re.finditer(self.regex, text))

        if result:
            return result
        else:
            return []

It got pretty complicated, but it worked. I’m not going to post all of the source code, but this should get someone started. In the end, it converted a file like this:

$LASTNAME, $FirstName $I. said on $date

Into a compiled regex with named capturing groups.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to extract data from a few large textfiles containing entries about people.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply