I have a SOLR database that needs to have a new field containing a list of strings that are kind of like tags, except they are predefined and used for an internal purpose. The search results from this SOLR core will go across the public Internet to 3rd party website developers. Therefore I want to obfuscate the tags, and make it impossible for someone to guess a tag that would reveal information about another customer.
I could easily accomplish this using GUIDs, but I wonder what the impact will be of having hundreds of thousands of records in RAM with a field containing an array of several GUIDs.
If the GUIDs were recorded as atoms, i.e. one copy of the GUID and many references to it, then this is a non-issue. But I cannot find out whether SOLR or Lucene use atoms in their in-RAM data structures. The disk storage is not an issue.
This is similar to dedup issues, but my research shows that people are mostly concerned with whole duplicate documents, not with individual fields.
There are two indexes: