Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7172193
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T15:34:01+00:00 2026-05-28T15:34:01+00:00

I recently came across the pandas library for python, which according to this benchmark

  • 0

I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It’s even faster than the data.table package in R (my language of choice for analysis).

Why is pandas so much faster than data.table? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I’m not aware of? Is there a way to perform inner and outer joins in data.table without resorting to merge(X, Y, all=FALSE) and merge(X, Y, all=TRUE)?

Comparison

Here’s the R code and the Python code used to benchmark the various packages.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T15:34:02+00:00Added an answer on May 28, 2026 at 3:34 pm

    It looks like Wes may have discovered a known issue in data.table when the number of unique strings (levels) is large: 10,000.

    Does Rprof() reveal most of the time spent in the call sortedmatch(levels(i[[lc]]), levels(x[[rc]])? This isn’t really the join itself (the algorithm), but a preliminary step.

    Recent efforts have gone into allowing character columns in keys, which should resolve that issue by integrating more closely with R’s own global string hash table. Some benchmark results are already reported by test.data.table() but that code isn’t hooked up yet to replace the levels to levels match.

    Are pandas merges faster than data.table for regular integer columns? That should be a way to isolate the algorithm itself vs factor issues.

    Also, data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.

    I’ll need some time to confirm as it’s the first I’ve seen of the comparison to data.table as presented.


    UPDATE from data.table v1.8.0 released July 2012

    • Internal function sortedmatch() removed and replaced with chmatch()
      when matching i levels to x levels for columns of type ‘factor’. This
      preliminary step was causing a (known) significant slowdown when the number
      of levels of a factor column was large (e.g. >10,000). Exacerbated in
      tests of joining four such columns, as demonstrated by Wes McKinney
      (author of Python package Pandas). Matching 1 million strings of which
      of which 600,000 are unique is now reduced from 16s to 0.5s, for example.

    also in that release was :

    • character columns are now allowed in keys and are preferred to
      factor. data.table() and setkey() no longer coerce character to
      factor. Factors are still supported. Implements FR#1493, FR#1224
      and (partially) FR#951.

    • New functions chmatch() and %chin%, faster versions of match()
      and %in% for character vectors. R’s internal string cache is
      utilised (no hash table is built). They are about 4 times faster
      than match() on the example in ?chmatch.

    As of Sep 2013 data.table is v1.8.10 on CRAN and we’re working on v1.9.0. NEWS is updated live.


    But as I wrote originally, above :

    data.table has time series merge in mind. Two aspects to that: i)
    multi column ordered keys such as (id,datetime) ii) fast prevailing
    join (roll=TRUE) a.k.a. last observation carried forward.

    So the Pandas equi join of two character columns is probably still faster than data.table. Since it sounds like it hashes the combined two columns. data.table doesn’t hash the key because it has prevailing ordered joins in mind. A “key” in data.table is literally just the sort order (similar to a clustered index in SQL; i.e., that’s how the data is ordered in RAM). On the list is to add secondary keys, for example.

    In summary, the glaring speed difference highlighted by this particular two-character-column test with over 10,000 unique strings shouldn’t be as bad now, since the known problem has been fixed.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I recently came across this website: http://studiostyles.info , which contains a list of color
I recently came across this blog post which basically says that we should not
I recently came across this blog post Yet another post about gamma correction which
I recently came across this in some code - basically someone trying to create
I recently came across a Windows library called AHK that gives me great control
I recently came across a question about sequence points in C++ at this site,
I recently came across the dataType called bytearray in python. Could someone provide scenarios
I recently came across this web page http://www.yoda.arachsys.com/csharp/readbinary.html explaining what precautions to take when
I recently came across DUI (Diggs user interface) which implements jquery and gives you
I recently came across the Ruby EOB / -EOB construct within this context (from

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.