Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8958127
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T15:08:39+00:00 2026-06-15T15:08:39+00:00

Consider the following program: #include <pthread.h> static int final_value = 0; #ifdef TLS_VAR static

  • 0

Consider the following program:

#include <pthread.h>

static int final_value = 0;

#ifdef TLS_VAR
static int __thread tls_var;
#else
static int tls_var;
#endif

void  __attribute__ ((noinline)) modify_tls(void) {
  tls_var++;
}

void *thread_function(void *unused) {
  const int iteration_count = 1 << 25;

  tls_var = 0;
  for (int i = 0; i < iteration_count; i++) {
    modify_tls();
  }
  final_value += tls_var;
  return NULL;
}

int main() {
  const int thread_count = 1 << 7;

  pthread_t thread_ids[thread_count];
  for (int i = 0; i < thread_count; i++) {
    pthread_create(&thread_ids[i], NULL, thread_function, NULL);
  }

  for (int i = 0; i < thread_count; i++) {
    pthread_join(thread_ids[i], NULL);
  }

  return 0;
}

On my i7, it takes 1.308 seconds to execute with TLS_VAR defined and
8.392 seconds with it undefined; and I am unable to account for such a huge
difference.

The assembly for modify_tls looks like this (I’ve only mentioned the
parts that are different):

;; !defined(TLS_VAR)
movl tls_var(%rip), %eax
addl $1, %eax
movl %eax, tls_var(%rip)

;; defined(TLS_VAR)
movl %fs:tls_var@tpoff, %eax
addl $1, %eax
movl %eax, %fs:tls_var@tpoff

The TLS lookup is understandable, with a load from the TCB. But why
is the tls_var load in the first case relative to %rip? Why can’t
it be a direct memory address which gets relocated by the loader? Is
this %rip relative load responsible for the slowness? If so, why?

Compile flags: gcc -O3 -std=c99 -Wall -Werror -lpthread

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T15:08:40+00:00Added an answer on June 15, 2026 at 3:08 pm

    Without the __thread attribute tls_var is simply a shared variable. Whenever one thread writes to it, the write goes first to the cache of the core, where the thread executes. But since it is a shared variable and x86 machines are cache coherent, the caches of the other cores get invalidated and their content refreshed from the last-level cache or from the main memory (in your case most likely from the last-level cache, which is the shared L3 cache on Core i7). Note that although faster than the main memory, the last-level cache is not infinitely fast – it still takes lots of cycles to get data from there moved to the L2 and L1 caches, private to each core.

    With the __thread attribute, each thread gets its own copy of tls_var, located in the thread-local storage. Since these thread-local storages are wide apart from each other in memory, no cache coherency messages are involved when they are being modified and the data stays in the fastest L1 cache.

    RIP-related addressing (the recommended by the System V ABI for x64 default addressing mode for “near” data) usually leads to faster data access, but the cache coherency overhead is so huge that the slower TLS access is actually faster when everything is kept in the L1 cache.

    This problem is hugely magnified on NUMA systems, e.g. on a multiprocessor (post-)Nehalem or AMD64 boards. Not only is it much more expensive to keep caches coherent, but also the shared variable would reside in the memory, attached to the socket, where the thread that first “touched” the variable has resided. Threads that run on cores from other sockets would then have to perform remote memory access through the QPI or HT bus that connects the sockets. As one visiting professor said recently (a rough paraphrase): “Program shared memory systems as if they are distributed memory systems.” This involves making local copies of global data to work on – exactly what the __thread attribute achieves.

    Also note, that with and without tls_var being in the TLS, you should expect different results. With it being in the TLS, modifications that one thread has made are not visible to the other threads. With it being a shared variable, you have to make sure that no more than one thread could access it at a given time. This is usually achieved with a critical section or a locked addition.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Consider the following program: #include <stdio.h> int main(void) { return 0; } When i
Consider following program: static void Main (string[] args) { int i; uint ui; i
Consider the following program: #define _POSIX_C_SOURCE 200809L #include <time.h> #include <pthread.h> #include <signal.h> void
Consider the following C program: #include <stdio.h> #include <stdarg.h> typedef void callptr(); static void
Consider the following program: class A { public static void Foo() { } }
Consider the following code: using System; namespace ConsoleApplication2 { class Program { static void
Lets us consider the following program : #include <stdlib.h> int main(int argc, char **argv){
Consider the following code: #include <stdio.h> int main (void) { char str1[128], str2[128], str3[128];
Consider the following program. #include <stdio.h> int main() { int a[10]={0}; printf(%p %p\n, a,
Consider the following program, which is obviously buggy: #include <cstdio> double test(int n) {

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.