Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 544817
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T10:41:43+00:00 2026-05-13T10:41:43+00:00

As a student of computational linguistics, I frequently do machine learning experiments where I

  • 0

As a student of computational linguistics, I frequently do machine learning experiments where I have to prepare training data from all kinds of different resources like raw or annotated text corpora or syntactic tree banks. For every new task and every new experiment I write programs (normally in Python and sometimes Java) to extract the features and values I need and transform the data from one format to the other. This usually results in a very large number of very large files and a very large number of small programs which process them in order to get the input for some machine learning framework (like the arff files for Weka).

One needs to be extremely well organised to deal with that and program with great care not to miss any important peculiarities, exceptions or errors in the tons of data. Many principles of good software design like design patterns or refactoring paradigms are no big use for these tasks because things like security, maintainability or sustainability are of no real importance – once the program successfully processed the data one doesn’t need it any longer. This has gone so far that I even stopped bothering about using classes or functions at all in my Python code and program in a simple procedural way. The next experiment will require different data sets with unique characteristics and in a different format so that their preparation will likely have to be programmed from scratch anyway. My experience so far is that it’s not unusual to spend 80-90% of a project’s time on the task of preparing training data. Hours and days go by only on thinking about how to get from one data format to another. At times, this can become quite frustrating.

Well, you probably guessed that I’m exaggerating a bit, on purpose even, but I’m positive you understand what I’m trying to say. My question, actually, is this:

Are there any general frameworks, architectures, best practices for approaching these tasks? How much of the code I write can I expect to be reusable given optimal design?

  • 1 1 Answer
  • 3 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T10:41:43+00:00Added an answer on May 13, 2026 at 10:41 am

    I find myself mostly using the textutils from GNU coreutils and flex for corpus preparation, chaining things together in simple scripts, at least when the preparations i need to make are simple enough for regular expressions and trivial filtering etc.

    It is still possible to make things reusable, the general rules also apply here. If you are programming with no regard to best practices and the like and just program procedurally there is IMHO really no wonder that you have to do everything from scratch when starting a new project.

    Even though the format requirements will vary a lot there is still many common tasks, ie. tag-stripping, tag-translation, selection, tabulation, some trivial data harvesting such as number of tokens, sentences and the like. Programming these tasks aming for high reusability will pay off, even though it takes longer at first.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm a student and I have a lot of free time on this vacation
I work with my student group on a project : We have some problems
I an a graduate student of nuclear physics currently working on a data analysis
I'm a student learning PHP. I basically make the stuff work, but never wondered
I have student.xml file and am parsing this file using SAX Parser and now
I have two student objects. class Student{ int physics; int english; int chemistry; }
Student.find(:all, :conditions => [‘name = ? and status = ?’ ‘mohit’, 1]) Vs Student.find_all_by_name_and_status(‘mohit’,
I am student of RDBMS. I have very basic question let say I have
I'm a student, have a test tomorrow, and I saw that last year there
I have 3 tables Student Loan Book - StudentID LoanID BookID which foreign keys

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.