Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7693559
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 31, 20262026-05-31T21:07:03+00:00 2026-05-31T21:07:03+00:00

I need to perform calculations and manipulation on an extremely large table or matrix,

  • 0

I need to perform calculations and manipulation on an extremely large table or matrix, which will have roughly 7500 rows and 30000 columns.

The matrix data will look like this:

Document ID| word1 | word 2 | word 3 |… | word 30000 | Document Class
0032 1 0 0 1 P

In other words, the vast majority of the cells will contain boolean values(0’s and 1’s).

The calculations that needs to be done would be useing word stemming or feature selection(reducing the number of words by using reduction techniques), as well as calculations per-class or per-word etc.

What i have in mind is designing an OOP model for representing the matrix, and then subsequently serializing the objects to disk so i may reuse them later on. For instance i will have an object for each row or each column, or perhaps an object for each intersection that is contained within another class.

I have thought about representing it in XML, but file sizes may prove problematic.

I may be sitting the pot miss with my approach here –
Am i on the right path, or would there be any better performing approaches to manipulating such large data collections.

Key issues here will be performance(reaction time etc.), as well as redundancy and integrity of the data, and obviously i would need to save the data on disk.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-31T21:07:04+00:00Added an answer on May 31, 2026 at 9:07 pm

    You haven’t explained the nature of the calculations you’re needing to do on the table/matrix, so I’m having to make assumptions there, but if I read your question correctly, this may be a poster-child case for the use of a relational database — even if you don’t have any actual relations in your database. If you can’t use a full server, use SQL Server Compact Edition as an embedded database, which would allow you to control the .SDF file programmatically if you chose.

    Edit:
    After a second consideration, I withdraw my suggestion for a database. This is entirely because of the number of columns in the table, any relational database you use will have hard limits on this, and I don’t see a way around it that isn’t amazingly complicated.

    Based on your edit, I would say that there are three things you are interested in:

    1. A way to analyze the presence of words in documents. This is the bulk of your sample data file, primarily being boolean values indicating the presence or lack of a word in a document.
    2. The words themselves. This is primarily contained in the first row of your sample data file.
    3. A means of identifying documents and their classification. This is the first and last column of your data file.

    After thinking about it for a little bit, this is how I would model your data:

    1. With the case of word presence, I feel it’s best to avoid a complex object model. You’re wanting to do pure calculation in both directions (by column and by row), and the most flexible and potentially performant structure for that in my opinion is a simple two-dimensional array of bool fields, like so:

      var wordMatrix = new bool[numDocuments,numWords];

    2. The words themselves should be in an array or list of strings that are index-linked to the second column of the word matrix — the one defined by numWords in the example above. If you ever needed to quickly search for a particular word, you could use a Dictionary<string, int>, with the key as the word and the value as the index, to quickly find the index of a particular word.

    3. The document identification would similarly be in an array or list of ints index-linked to the first column. I’m assuming the document ids are integer values there. The classification would be a similar array or list, although I’d use a list of enums representing each possible value of the classification. As with the word search, if you needed to search for documents by id, you could have a Dictionary<int, int> act as your search index.

    I’ve made several assumptions with this model, particularly that you want to do pure calculation on the word presence in all directions rather than “per document”. If I’m wrong, a simpler approach might be to drop the two-dimensional array and model by document, i.e. a single C# Document class, with a DocumentId, and DocumentClasification field as well as a simple array of booleans that are index-linked to the word list. You could then work with a list of these Document objects along with a separate list of words.

    Once you have a data model you like, saving it to disk is the easiest part. Just use C# serialization. You can save it via XML or binary, your choice. Binary would give you the smallest file size, naturally (I figure a little more than 200MB plus the size of a list of 30000 words). If you include the Dictionary lookup indexes, perhaps an additional 120kB.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have two T-SQL scalar functions that both perform calculations over large sums of
I need to perform some calculations a large list of numbers. Do array.array or
I need to perform a time sheet calculations Employee will punch in and punch
I need to read in a couple of extremely large strings which are comprised
I need to perform a find and replace using XSLT 1.0 which is really
I need to perform an action after a session times out. However I have
I have 2 tables that I'm trying to use to perform some calculations. My
I am working on a project where I have to perform calculations on arrays
I have a scenario where I need to pull approximately 7500 database records where
I have a Dictionary with a few hundred thousand elements. I need to perform

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.