Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7929895
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 3, 20262026-06-03T20:14:22+00:00 2026-06-03T20:14:22+00:00

I’m trying to do a document classification using Weka java API. Here is my

  • 0

I’m trying to do a document classification using Weka java API.

Here is my directory structure of the data files.

+- text_example
|
+- class1
|  |
|  3 html files
|
+- class2
|   |
|   1 html file
|
+- class3
    |
    3 html files

I have the ‘arff’ file created with ‘TextDirectoryLoader’. Then I use the StringToWordVector filter on the created arff file, with filter.setOutputWordCounts(true).

Below is a sample of the output once the filter is applied. I need to get few things clarified.

@attribute </form> numeric
@attribute </h1> numeric
.
.
@attribute earth numeric
@attribute easy numeric

This huge list should be the tokenization of the content of the initial html files. right?

Then I have,

@data
{1 2,3 2,4 1,11 1,12 7,..............}
{10 4,34 1,37 5,.......}
{2 1,5 6,6 16,...}
{0 class2,34 11,40 15,.....,4900 3,...
{0 class3,1 2,37 3,40 5....
{0 class3,1 2,31 20,32 17......
{0 class3,32 5,42 1,43 10.........

why there is no class attribute for the first 3 items? (it should have class1).
what does the leading 0 means as in {0 class2,..}, {0 class3..}.
It says, for instance, that in the 3rd html file in the class3 folder, the word identified by the integer 32 appears 5 times. Just to see how do I get the word (token) referred by 32?

How do I reduce the dimensionality of the feature vector? don’t we need to make all the feature vectors the same size? (like consider only the say 100 most frequent terms from the training set and later when it comes to testing, consider the occurrence of only those 100 terms in test documents. Because, in this way what happens if we come up with a totally new word in the testing phase, will the classifier just ignore it?).

Am I missing something here? I’m new to Weka.

Also I really appreciate the help if someone can explain me how the classifier uses this vector created with StringToWordVector filter. (like creating the vocabulary with the training data, dimensionality reduction, are those happening inside the Weka code?)

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-03T20:14:24+00:00Added an answer on June 3, 2026 at 8:14 pm
    1. The huge list of @attribute contains all the tokens derived from your input.
    2. Your @data section is in the sparse format, that is for each attribute, the value is only stated if it is different from zero. For the first three lines, the class attribute is class1, you just can’t see it (if it were unknown, you would see a 0 ? at the beginning of the first three lines). Why is that so? Weka internally represents nominal attributes (that includes classes) as doubles and starts counting at zero. So your three classes are internally: class1=0.0, class2=1.0, class3=2.0. As zero-values are not stated in the sparse format, you can’t see the class in the first three lines. (Also see the section “Sparse ARFF files” on http://www.cs.waikato.ac.nz/ml/weka/arff.html)
    3. To get the word/token represented by index n, you can either count or, if you have the Instances object, invoke attribute(n).name() on it. For that, n starts counting at 0.
    4. To reduce dimensionality of the feature vector, there are a lot of options. If you only want to have the 100 most frequent terms, you stringToWordVector.setWordsToKeep(100). Note that this will try to keep 100 words of every class. If you do not want to keep 100 words per class, stringToWordVector.setDoNotOperateOnPerClassBasis(true). You will get slightly above 100 if there are several words with the same frequency, so the 100 is just a kind of target value.
    5. As for the new words occuring in the test phase, I think that cannot happen because you have to hand the stringToWordVector all instances before classifying. I am not 100% sure on that one though, as I am using a two-class setup and I let StringToWordVector transform all my instances before telling the classifier anything about it.

    I can generally recomment to you, to experiment with the Weka KnowledgeFlow tool to learn how to use the different classes. If you know how to do things there, you can use that knowledge for your Java code quite easily.
    Hope I was able to help you, although the answer is a bit late.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have thousands of HTML files to process using Groovy/Java and I need to
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
I'm making a simple page using Google Maps API 3. My first. One marker
I am trying to understand how to use SyndicationItem to display feed which is
Basically, what I'm trying to create is a page of div tags, each has
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I am reading a book about Javascript and jQuery and using one of the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.