Usually one wants to get a feature from a text by using the bag

Question

0

Asked: June 18, 20262026-06-18T15:20:32+00:00 2026-06-18T15:20:32+00:00

Usually one wants to get a feature from a text by using the bag

0

Usually one wants to get a feature from a text by using the bag of words approach, counting the words and calculate different measures, for example tf-idf values, like this: How to include words as numerical feature in classification

But my problem is different, I want to extract a feature vector from a single word. I want to know for example that potatoes and french fries are close to each other in the vector space, since they are both made of potatoes. I want to know that milk and cream also are close, hot and warm, stone and hard and so on.

What is this problem called? Can I learn the similarities and features of words by just looking at a large number documents?

I will not make the implementation in English, so I can’t use databases.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T15:20:33+00:00

hmm,feature extraction (e.g. tf-idf) on text data are based on statistics. On the other hand, you are looking for sense (semantics). Therefore no such a method like tf-idef will work for you.

In NLP exists 3 basic levels:

morphological analyses
syntactic analyses
semantic analyses

(higher number represents bigger problems :)). Morphology is known for majority languages. Syntactic analyses is a bigger problem (it deals with things like what is verb, noun in some sentence,…). Semantic analyses has the most challenges, since it deals with meaning which is quite difficult to represent in machines, have many exceptions and are language-specific.

As far as I understand you want to know some relationships between words, this can be done via so-called dependency tree banks, (or just treebank): http://en.wikipedia.org/wiki/Treebank . It is a database/graph of sentences where a word can be considered as a node and relationship as arc. There is good treebank for czech language and for english there will be also some, but for many ‘less-covered’ languages it can be a problem to find one …

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Usually one wants to get a feature from a text by using the bag

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply