Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6111933
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T14:41:14+00:00 2026-05-23T14:41:14+00:00

I’m trying to work out how to implement some machine learning library to help

  • 0

I’m trying to work out how to implement some machine learning library to help me find out what the correct weighting for each parameter is in order to make a good decision.

In more detail:

Context: trying to implement a date of publication extractor for html files. This is for news sites, so I don’t have a generic date format that I can use. I’m using the parser in dateutil in python, which does a pretty good job. I end up with a list of possible publication dates (all the dates in the html file).

From a set of parameters, such as close tags, words close to the date substring, etc. I sort the list according to likelihood of being the publication date. The weighting for each parameter are somehow educated guesses.

I would like to implement a machine learning algorithm that, after a training period (in which the actual publication date is provided), it determines what the weighting for each parameter should be.

I’ve been reading the documentation of different machine learning libraries in python (pyML, scikit-learn, pybrain), but I haven’t found anything useful. I’ve also read this and there’s a close example with determining if a mushroom is eadible or not.

Note: I’m working in python.

I would very much appreciate your help.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T14:41:15+00:00Added an answer on May 23, 2026 at 2:41 pm

    Given your problem description, the characteristics of yoru data, and your ML background and personal preferences, i would recommend Orange.

    Orange is a mature, free and open source project with a large selection of ML algorithms and excellent documentation and training materials. Most users probably use the GUI supplied with Orange, but the framework is scriptable with Python.

    Using this framework therefore, will of course enable you to quickly experiment with a variety of classifiers because (i) they are all in one place; and (ii) each is accessed a common configuration syntax GUI. All of the ML techniques within the Orange framework can be run in “demo” mode
    one or more sample data sets supplied with the Orange install. The documentation supplied
    in the Orange Install is excellent. In addition, the Home Page includes links to numerous
    tutorials that cover probably every ML technique included in the framework.

    Given your problem, perhaps begin with a Decision Tree algorithm (either C4.5 or ID3 implementation). A fairly recent edition of Dr. Dobbs Journal (online) includes an excellent article on using decision trees; the use case is web server data (from the server access log).

    Orange has a C4.5 implementation, available from the GUI (as a “widget”). If that’s too easy, about 100 lines is all it takes to code one in python. Here‘s the source for a working implementation in that language

    I recommend starting with a Decision Tree for several reasons.

    1. If it works on your data, you will
      not only have a trained classifier,
      but you will also have a visual
      representation of the entire
      classification schema
      (represented
      as a binary tree). Decision Trees are (probably) unique among ML techniques in this respect.

    2. The characteristics of your data are
      aligned with the optimal performance
      scenario of C4.5; the data can be
      either categorical or continuous
      variables (though this technique
      performs better with if more
      features (columns/fields) discrete
      rather than continuous, which seems
      to describe your data); also
      Decision Tree algorithms can accept,
      without any pre-processing,
      incomplete data points

    3. Simple data pre-processing. The data fed to a decision tree
      algorithm does not require as much
      data pre-processing as most other ML
      techniques; pre-processing is often
      (usually?) the most time-consuming
      task in the entire ML workflow. It’s
      also sparsely documented, so it’s
      probably also the most likely source
      of error.

    4. You can deduce the (relative) weight of each variable from each node’s distance from the root–in other words, from a quick visual
      inspection of the trained
      classifier
      . Recall that the trained classifier
      is a just a binary tree (and is often rendered this way) in which the nodes
      correspond to one value of one
      feature (variable, or column in your
      data set); the two edges joined to
      that node of course represent the
      data points split into two groups
      based on each point’s value for that
      feature (e.g., if the feature is the
      categorical variable “Publication
      Date in HTML Page Head?”, then
      through the left edge will flow all
      data points in which the
      publication date is not within the
      opening and closing head tags, and
      the right node gets the other
      group). What is the significance of
      this? Since a node just represents
      a state or value for a particular
      variable, that variable’s
      importance (or weight) in
      classifying the data can be deduced
      from its position in the
      tree–i.e., the closer it is to the
      root node, the more important it is.

    From your Question, it seems you have two tasks to complete before you can feed your training data to a ML classifier.

    I. identify plausible class labels

    What you want to predict is a date. Unless your resolution requirements are unusually strict (e.g., resolved to a single date) i would build a classification model (which returns a class label given a data point) rather than a regression model (returns a single continuous value).

    Given that your response variable is a date, a straightforward approach is to set the earliest date to the baseline, 0, then represent all other dates as an integer value that represents the distance from that baseline. Next, discretize all dates into a small number of ranges. One very simple technique for doing this is to calculate the five summary descriptive statistics for your response variable (min, 1st quartile, mean, 3rd quartile, and max). From these five statistics, you get four sensibly chosen date ranges (though probably not of equal span or of equal membership size.

    These four ranges of date values then represent your class labels–so for instance, classI might be all data points (web pages, i suppose) whose response variable (publication date) is 0 to 10 days after 0; classII is 11 days after 0 to 25 days after 0, etc.

    [Note: added the code below in light of the OP’s comment below this answer, requesting clarification.]

    # suppose these are publication dates
    >>> pd0 = "04-09-2011"      
    >>> pd1 = "17-05-2010"
    # convert them to python datetime instances, e.g., 
    >>> pd0 = datetime.strptime(pd0, "%d-%m-%Y")
    # gather them in a python list and then call sort on that list:
    >>> pd_all = [pd0, pd1, pd2, pd3, ...]
    >>> pd_all.sort()
    # 'sort' will perform an in-place sort on the list of datetime objects,
    # such that the eariest date is at index 0, etc.
    # now the first item in that list is of course the earliest publication date
    >>> pd_all[0]
    datetime.datetime(2010, 5, 17, 0, 0)
    # express all dates except the earliest one as the absolute differenece in days
    # from that earliest date
    >>> td0 = pd_all[1] - pd_all[0]           # t0 is a timedelta object
    >>> td0
    datetime.timedelta(475)     
    # convert the time deltas to integers:
    >>> fnx = lambda v : int(str(v).split()[0])
    >>> time_deltas = [td0,....]
    # d is jsut a python list of integers representing number of days from a common baseline date
    >>> d = map(fnx, time_deltas)    
    

    II. convert your raw data to an “ML-useable” form.

    For a C4.5 classifier, this task is
    far simpler and requires fewer steps than for probably every other ML algorithm. What’s
    preferred here is to discretize to a relatively small number of values,
    as many of your parameters as possible–e.g., if one of your parameters/variables is
    “distance of the publication date string from the closing body tag”, then i would
    suggest discretizing those values into ranges, as marketing surveys often ask
    participants to report their age in one of a specified set of spans (18 – 35; 36 – 50, etc.)
    rather than as a single integer (41).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Basically, what I'm trying to create is a page of div tags, each has
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
I am trying to understand how to use SyndicationItem to display feed which is
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have just tried to save a simple *.rtf file with some websites and
For some reason, after submitting a string like this Jack’s Spindle from a text
Seemingly simple, but I cannot find anything relevant on the web. What is the
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
I'm trying to use string.replace('’','') to replace the dreaded weird single-quote character: ’ (aka
I'm trying to create an if statement in PHP that prevents a single post

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.