Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 136249
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T06:55:07+00:00 2026-05-11T06:55:07+00:00

I have a large (~2.5M records) data base of image metadata. Each record represents

  • 0

I have a large (~2.5M records) data base of image metadata. Each record represents an image and has a unique ID, a description field, a comma-separated list of keywords (say 20-30 keywords per image), and some other fields. There’s no real database schema, and I have no way of knowing which keywords exists in the database without iterating over every image and counting them. Also, the metadata comes from several different suppliers, who each have their own ideas about how to fill out the different fields.

There are some things I would like to do with this metadata, but since I’m totally new to this kind of algorithms I don’t even know where to begin looking.

  1. Some of these images have certain usage restrictions on them (given in text), but each supplier phrase them differently, and there is no way to guarantee consistency. I’d like to have a simple test I could apply to an image that gives an indication if that image is free from restrictions or not. It doesn’t have to be perfect, just ‘good enough’. I suspect I could use some kind Bayesian filter for this, right? I could train the filter with a corpus of images that I know are either restricted or restriction-free, and then the filter would be able to make predictions for the rest of the images? Or are there better ways?
  2. I would also like to be able to index these images according to ‘keyword likeness’, so that if I have one image, I could quickly tell which other images it shares the most keywords with. Ideally, the algorithm would also take into account that some keywords are more significant than others and weigh them differently. I don’t even know where to start looking here, and would be very glad for any pointers 🙂

I’m working primarily in Java, but language choice is irrelevant here. I’m more interested in learning what approaches would be best for me to start reading up on. Thanks in advance 🙂

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-11T06:55:07+00:00Added an answer on May 11, 2026 at 6:55 am

    (1) Looks like a classification problem with words in your text as features, and ‘Restricted’ and ‘Not Restricted’ as your labels. Bayesian filtering or any classification algorithm should do the trick.

    (2) Looks like a clustering problem. First you want to come up with a good similarity function that returns a similarity score for two images bases on their keywords. Cosine similarity might be a good starting point, since you are comparing keywords. From there you can compute a similarity matrix and just remember a list of ‘nearest neighbors’ for each image in your dataset, or you can go further and use a clustering algorithm to come up with actual clusters of images.

    Since you have so many records, you might want to skip computing the entire similarity matrix, and just compute clusters for a small, random sample of your dataset. You can then add the other data points to the appropriate clusters. If you want to preserve more similarity information you can look into soft clustering.

    Hopefully that will get you started.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a large Oracle database ( 720,000 records aprox) where each record has
Interpolating Large Datasets I have a large data set of about 0.5million records representing
I have a pretty large table (20M records) which has a 3 column index
Problem: I have a large database table (~500k records) which has a list of
I have a large table(60 columns, 1.5 million records) of denormalized data in MS
I have a very strange situation. I have a large set of records to
I have a large table with 1 million+ records. Unfortunately, the person who created
I have a large table (more than 10 millions records). this table is heavily
I have a large dataset (over 100,000 records) that I wish to load into
I have a large xml document that needs to be processed 100 records at

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.