Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6634583
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T22:56:49+00:00 2026-05-25T22:56:49+00:00

I have two large data frames, a and b for which identical(a,b) is TRUE

  • 0

I have two large data frames, a and b for which identical(a,b) is TRUE, as is all.equal(a,b), but identical(digest(a),digest(b)) is FALSE. What could cause this?

What’s more, I tried to dig in deeper, by applying digest to bunches of rows. Incredibly, at least to me, there is agreement in the digest values on sub-frames all the way to the last row of the data frames.

Here is a sequence of comparisons:

> identical(a, b)
[1] TRUE
> all.equal(a, b)
[1] TRUE
> digest(a)
[1] "cac56b06078733b6fb520442e5482684"
> digest(b)
[1] "fdd5ab78ca961982d195f800e3cf60af"
> digest(a[1:nrow(a),])
[1] "e44f906723405756509a6b17b5949d1a"
> digest(b[1:nrow(b),])
[1] "e44f906723405756509a6b17b5949d1a"

Every method I can think of indicates these two objects are identical, but their digest values are different. Is there something else about data frames that can produce such discrepancies?


For further details: the objects are about 10M rows x 12 columns. Here’s the output of str():

'data.frame':   10056987 obs. of  12 variables:
 $ V1 : num  1 11 21 31 41 61 71 81 91 101 ...
 $ V2 : num  1 1 1 1 1 1 1 1 1 1 ...
 $ V3 : num  2 3 2 3 4 5 2 4 2 4 ...
 $ V4 : num  1 1 1 1 1 1 1 1 1 1 ...
 $ V5 : num  1.8 2.29 1.94 2.81 3.06 ...
 $ V6 : num  0.0653 0.0476 0.0324 0.034 0.0257 ...
 $ V7 : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
 $ V8 : num  0.00653 0.00476 0.00324 0.0034 0.00257 ...
 $ V9 : num  1.8 2.3 1.94 2.81 3.06 ...
 $ V10: num  0.1957 0.7021 0.0604 0.1866 0.9371 ...
 $ V11: num  1704 1554 1409 1059 1003 ...
 $ V12: num  23309 23309 23309 23309 23309 ...

> print(object.size(a), units = "Mb")
920.7 Mb

Update 1: On a whim, I converted these to matrices. The digests are the same.

> aM = as.matrix(a)
> bM= as.matrix(b)
> identical(aM,bM)
[1] TRUE
> digest(aM)
[1] "c5147d459ba385ca8f30dcd43760fc90"
> digest(bM)
[1] "c5147d459ba385ca8f30dcd43760fc90"

I then tried converting back to a data frame, and the digest values are equal (and equal to the previous value for a).

> aMF = as.data.frame(aM)
> bMF = as.data.frame(bM)
> digest(aMF)
[1] "cac56b06078733b6fb520442e5482684"
> digest(bMF)
[1] "cac56b06078733b6fb520442e5482684"

So, b looks like the bad boy, and it has a colorful past. b came from a much bigger data frame, say B. I took only the columns of B that appeared in a and checked to see if they were equal. Well, they were equal, but had different digests. I converted the column names (from “InformativeColumnName1” to “V1”, etc.), just to avoid any issues that might arise – though all.equal and identical tend to point out when column names differ.

Since I am working on two different programs and don’t have simultaneous access to a and b, it is easiest for me to use the digest values to check the calculations. However, something seems to be odd in how I extract columns from a data frame and then apply digest() to it.


ANSWER:
It turns out, to my astonishment (dismay, horror, embarrassment, you name it), identical is very forgiving about attributes. I had assumed that only all.equal was forgiving about attributes.

This was discovered via Tommy’s suggestion identical(d1, d2, attrib.as.set=FALSE). Running attributes(a) is a bad, bad idea: the deluge of row names took awhile before Ctrl-C could interrupt it. Here is the output of names(attributes()):

> names(attributes(a))
[1] "names"     "row.names" "class"    
> names(attributes(b))
[1] "names"     "class"     "row.names"

They’re in different orders! Kudos to digest() for being straight with me.

UPDATE

To aid others with this problem, it seems that simply rearranging the attributes will be adequate to get identical hash values. Since tinkering with attribute orders is new to me, this may break something, but it works in my case. Note that it is a little time consuming if the objects are big; I’m not aware of a faster method for doing this. (I’m also looking to move to using matrices or data tables instead of data frames, and this may be another incentive to avoid data frames.)

tmpA0   = attributes(a)
tmpA1   = tmpA0[sort(names(tmpA0))]
a2      = a
attributes(a2) = tmpA1

tmpB0   = attributes(b)
tmpB1   = tmpB0[sort(names(tmpB0))]
b2      = b
attributes(b2) = tmpB1

digest(a2)  # e04e624692d82353479efbd713ec03f6
digest(b2)  # e04e624692d82353479efbd713ec03f6

identical(b,b2, attrib.as.set = FALSE) # FALSE
identical(b,b2, attrib.as.set = TRUE) # TRUE
identical(a2,b2, attrib.as.set = FALSE) # TRUE
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T22:56:50+00:00Added an answer on May 25, 2026 at 10:56 pm

    Without having the actual data.frames it is of course hard to know, but one difference could be the order of the attributes. identical ignores that by default, but setting attrib.as.set=FALSE can change that:

    d1 <- structure(1, foo=1, bar=2)
    d2 <- structure(1, bar=2, foo=1)
    
    identical(d1, d2) # TRUE
    identical(d1, d2, attrib.as.set=FALSE) # FALSE
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a large number of data points which are two dimensional coordinates with
I have compared two queries which fetch some fairly large data from a database
I have two sets of data which I need to join, but there is
I have two data frames. One contains a large amount of data. The second
I have two large identical-sized files. One is ASCII plain text, and the other
I have two batch files which is used to run a large C++ build,
I'm currently working on a project where we have a large data warehouse which
I have two large data sets and I am attempting to reformat the older
I have two Solr server. The databases every day large amounts of data changes
I have two tables: p_group.full_data, which is a large dataset I'm working on (100k

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.