thehftguy.com Open in urlscan Pro
192.0.78.25 Public Scan

Back to summary

URL:
https://thehftguy.com/2023/11/14/the-linux-kernel-has-been-accidentally-hardcoded-to-a-maximum-of-8-cores-for-nearly-2...
Submission: On November 16 via manual (November 16th 2023, 10:41:24 pm UTC) from US — Scanned from DE

Form analysis
4 forms found in the DOM

POST https://thehftguy.com/wp-comments-post.php

<form action="https://thehftguy.com/wp-comments-post.php" method="post" id="commentform" class="comment-form" novalidate="">
  <div id="comment-form__verbum" class="light"></div>
  <div class="verbum-form-meta"><input type="hidden" name="comment_post_ID" value="16443" id="comment_post_ID">
    <input type="hidden" name="comment_parent" id="comment_parent" value="0">
    <input type="hidden" name="highlander_comment_nonce" id="highlander_comment_nonce" value="839e2c7c38">
  </div>
  <p style="display: none;"><input type="hidden" id="akismet_comment_nonce" name="akismet_comment_nonce" value="60df3bb8c3"></p>
  <p style="display: none !important;"><label>Δ<textarea name="ak_hp_textarea" cols="45" rows="8" maxlength="100"></textarea></label><input type="hidden" id="ak_js_1" name="ak_js" value="1700174478858">
    <script>
      document.getElementById("ak_js_1").setAttribute("value", (new Date()).getTime());
    </script>
  </p>
</form>

POST https://subscribe.wordpress.com

<form action="https://subscribe.wordpress.com" method="post" accept-charset="utf-8" data-blog="108084023" data-post_access_level="everybody" id="subscribe-blog">
  <p>Follow this blog. Get news about the cloud and the latest devops tools.</p>
  <p id="subscribe-email">
    <label id="subscribe-field-label" for="subscribe-field" class="screen-reader-text"> Email Address: </label>
    <input type="email" name="email" style="width: 95%; padding: 1px 10px" placeholder="Email Address" value="" id="subscribe-field" required="">
  </p>
  <p id="subscribe-submit">
    <input type="hidden" name="action" value="subscribe">
    <input type="hidden" name="blog_id" value="108084023">
    <input type="hidden" name="source" value="https://thehftguy.com/2023/11/14/the-linux-kernel-has-been-accidentally-hardcoded-to-a-maximum-of-8-cores-for-nearly-20-years/">
    <input type="hidden" name="sub-type" value="widget">
    <input type="hidden" name="redirect_fragment" value="subscribe-blog">
    <input type="hidden" id="_wpnonce" name="_wpnonce" value="37f106c06c"> <button type="submit" class="wp-block-button__link"> Follow </button>
  </p>
</form>

POST https://subscribe.wordpress.com

<form method="post" action="https://subscribe.wordpress.com" accept-charset="utf-8" style="display: none;">
  <div class="actnbr-follow-count">Join 539 other followers</div>
  <div>
    <input type="email" name="email" placeholder="Enter your email address" class="actnbr-email-field" aria-label="Enter your email address">
  </div>
  <input type="hidden" name="action" value="subscribe">
  <input type="hidden" name="blog_id" value="108084023">
  <input type="hidden" name="source" value="https://thehftguy.com/2023/11/14/the-linux-kernel-has-been-accidentally-hardcoded-to-a-maximum-of-8-cores-for-nearly-20-years/">
  <input type="hidden" name="sub-type" value="actionbar-follow">
  <input type="hidden" id="_wpnonce" name="_wpnonce" value="37f106c06c">
  <div class="actnbr-button-wrap">
    <button type="submit" value="Sign me up"> Sign me up </button>
  </div>
</form>

<form id="jp-carousel-comment-form">
  <label for="jp-carousel-comment-form-comment-field" class="screen-reader-text">Write a Comment...</label>
  <textarea name="comment" class="jp-carousel-comment-form-field jp-carousel-comment-form-textarea" id="jp-carousel-comment-form-comment-field" placeholder="Write a Comment..."></textarea>
  <div id="jp-carousel-comment-form-submit-and-info-wrapper">
    <div id="jp-carousel-comment-form-commenting-as">
      <fieldset>
        <label for="jp-carousel-comment-form-email-field">Email (Required)</label>
        <input type="text" name="email" class="jp-carousel-comment-form-field jp-carousel-comment-form-text-field" id="jp-carousel-comment-form-email-field">
      </fieldset>
      <fieldset>
        <label for="jp-carousel-comment-form-author-field">Name (Required)</label>
        <input type="text" name="author" class="jp-carousel-comment-form-field jp-carousel-comment-form-text-field" id="jp-carousel-comment-form-author-field">
      </fieldset>
      <fieldset>
        <label for="jp-carousel-comment-form-url-field">Website</label>
        <input type="text" name="url" class="jp-carousel-comment-form-field jp-carousel-comment-form-text-field" id="jp-carousel-comment-form-url-field">
      </fieldset>
    </div>
    <input type="submit" name="submit" class="jp-carousel-comment-form-button" id="jp-carousel-comment-form-button-submit" value="Post Comment">
  </div>
</form>

Text Content

Skip to content
Open Menu
* Home
* About Me
* Contact
* Privacy Policy

THE HFT GUY

A DEVELOPER IN LONDON

tech

THE LINUX KERNEL SCHEDULER HAS BEEN ACCIDENTALLY HARDCODED TO A MAXIMUM OF 8
CORES FOR THE PAST 15 YEARS AND NOBODY NOTICED

14 November 202315 November 2023 thehftguy9 Comments
i

1279 Votes

TL;DR This doesn’t mean that the scheduler can’t use more than 8 cores. The
scheduler controls how to allocate tasks to available cores. How to schedule
particular workloads efficiently on available hardware is a complex problem.
There are settings and hardcoded timings to control the behavior of the
scheduler, some vary with the number of cores, some don’t, unfortunately they
don’t work as intended because they were capped to 8 cores. This had a rationale
around 2005-2010 when the latest CPUs were the core 2 duo and core 2 quad on
interactive desktops and “nobody will ever get more than 8 cores”, this doesn’t
hold as well in 2023 when the baseline is 128 cores per CPU on non-interactive
servers.

TL;DR; Yes, this has performance implications especially if you run compute
clusters, no, your computer won’t get 20 times faster. Sorry. Either way,
comments in the kernel, man pages and sysctl documentation are wrong and should
be corrected, they genuinely missed for the last 15 years that there was a
scaling factor at play capped to 8 cores.

A BIT OF HISTORY

I’ve been diving into the Linux kernel scheduler recently.

To give a short brief introduction to scheduling, imagine a single CPU single
core system. The operating system allocates time slices of a few milliseconds to
run applications. If every application can get a few milliseconds now and then,
the system feels interactive, to the user it feels like the computer is running
hundreds of tasks while the computer is only running 1 single task at any time.

Then multi core systems became affordable for consumers in the early 2000s, most
computers got 2 or 4 cores and the operating system could run multiple tasks in
parallel for real. That changed things a bit but not too much, server could
already have multi 2 or 4 physical CPUs in a server.

The number of cores increased over time and we’ve reached a state in early 2020s
where the baseline is AMD servers with 128 cores per CPU.

For the anecdote, historically on windows the period was around 16 milliseconds,
it had a funny side effect where an application doing a sleep(1ms) would resume
around 16 milliseconds later.

LINUX SCHEDULER

Back to Linux, short and simplified version.

On Linux the kernel scheduler works in periods of 5 milliseconds (or 20
milliseconds on multicore systems) to ensure that every application has a chance
to run in the next 5 milliseconds. A thread that takes more than 5 milliseconds
is supposed to be pre-empted.

It’s important for end user latency. In layman terms, when you move you mouse
around AND the mouse is moving on the screen at least every few milliseconds,
the computer feels responsive and it’s great. That’s why the Linux kernel was
hardcoded to 5 milliseconds forever ago, to give a good experience to
interactive desktop users.

It’s intuitive enough that the behavior of the scheduler had to change when CPU
got multiple cores. There is no need to aggressively interrupt tasks all the
time to run something else, when there are multiple CPUs to work with.

It’s a balancing act. You don’t want to reschedule tasks every few milliseconds
to another core, because it’s stopping the task and causing context switches and
breaking caches, making all tasks slower, but you have to keep things
responsive, especially if it’s a desktop with an end user, is it a desktop
though?

The scheduler can be tuned with sysctl settings. There are many settings
available, see this article for an intro but know that all scaling numbers are
wrong since the kernel was accidentally hardcoded to 8 cores
https://dev.to/satorutakeuchi/the-linux-s-sysctl-parameters-about-process-scheduler-1dh5

* sched_latency_ns
* sched_min_granularity_ns (renamed to base_slice in kernel v6)
* wakeup_granularity_ns

COMMITS

This commit added automated scaling of scheduler settings with the number of
cores in 2009.

Privacy Settings

It was accidentally hardcoded to a maximum of 8 cores. Oops. The magic value
comes from an older commit and remained as it was. Please refer to the full diff
on GitHub.

https://github.com/torvalds/linux/commit/acb4a848da821a095ae9e4d8b22ae2d9633ba5cd

An interesting setting is the minimum granularity, this is supposed to allow
tasks to run for a minimum amount of 0.75 milliseconds (or 3 ms in multicore
systems) when the system is overloaded (there are more tasks to run in the
period than there are CPU available).

The min_granularity setting was renamed to base_slice in this commit in v6
kernel.

The comment says it scales with CPU count and the comment is incorrect. I wonder
whether kernel developers are aware of that mistake as they are rewriting the
scheduler!

* Official comments in the code says it’s scaling with log2(1+cores) but it
doesn’t.
* All the comments in the code are incorrect.
* Official documentation and man pages are incorrect.
* Every blog article, stack overflow answer and guide ever published about the
scheduler is incorrect.

https://github.com/torvalds/linux/commit/e4ec3318a17f5dcf11bc23b2d2c1da4c1c5bb507

Below is the function that does the scaling. This is the code where it’s been
hardcoded to a maximum of 8 cores.

The maximum scaling factor is 4 on multi core systems: 1+log2(min(cores, 8)).

The default is to use log scaling, you can adjust kernel settings to use linear
scaling instead but it’s broken too, capped to 8 cores.

https://github.com/torvalds/linux/blame/master/kernel/sched/fair.c#L198

Recent AMD desktop have 32 threads, recent AMD servers have 128 threads per CPU
and often have multiple physical CPUs.

It’s problematic that the kernel was hardcoded to a maximum of 8 cores (scaling
factor of 4). It can’t be good to reschedule hundreds of tasks every few
milliseconds, maybe on a different core, maybe on a different die. It can’t be
good for performance and cache locality.

To conclude, the kernel has been accidentally hardcoded to 8 cores for the past
15 years and nobody noticed.

Oops. ¯\_(ツ)_/¯

Linux kernel scheduler meets 128 cores AMD CPUs

SHARE THIS ARTICLE:

* Click to share on Twitter (Opens in new window)
* Click to share on Reddit (Opens in new window)
* Click to share on Facebook (Opens in new window)
* Click to share on LinkedIn (Opens in new window)
* Click to share on Hacker News (Opens in new window)
* Click to email a link to a friend (Opens in new window)
* More
*

* 1Click to share on Pinterest (Opens in new window)1
* Click to share on Pocket (Opens in new window)
* Click to share on Telegram (Opens in new window)
* Click to share on WhatsApp (Opens in new window)
* Click to share on Tumblr (Opens in new window)
* Click to print (Opens in new window)
*

LIKE THIS:

Like Loading...

WHAT AES CIPHERS TO USE BETWEEN CBC, GCM, CCM, CHACHA-POLY?

TL;DR If you only have 5 seconds to pick only one, go with AES-GCM. Most
systems/libraries do both AES-GCM and ChaCha20-Poly1305 out-of-the-box. AES-GCM
(Galois Counter Mode) The most widely used block cipher worldwide.Mandatory as
of TLS 1.2 (2008) and used by default by most clients.RFC 5288 year 2008
https://tools.ietf.org/html/rfc5288 ChaCha20-Poly1305…

20 April 2020

In "tech"

GOOGLE CLOUD IS 50% CHEAPER THAN AWS

Let's revisit Google and Amazon pricing since the AWS November 2016 Price
Reduction. We'll analyse instance costs, for various workloads and usages. All
prices are given in dollars per month (720 hours) for servers located in Europe
(eu-west-1). Shared CPU Instances Shared CPU instances give only a bit of CPU.…

18 November 2016

In "tech"

MAJOR BUG IN GLIBC IS KILLING APPLICATIONS WITH A MEMORY LIMIT

This is the story of a debugging case with a happy ending. TL;DR This is a bug
in the glibc malloc(). It mainly affects 64 bits multi-threading applications.
With a special mention to Java applications because the JVM seems to always
trigger the worst case. The write up is quite…

21 May 2020

In "tech"

bug, cpu, linux kernel, scheduler

POST NAVIGATION

Previous Post
Why are there no antitrust claims vs GitHub Copilot, when there is a precedent?

9 THOUGHTS ON “THE LINUX KERNEL SCHEDULER HAS BEEN ACCIDENTALLY HARDCODED TO A
MAXIMUM OF 8 CORES FOR THE PAST 15 YEARS AND NOBODY NOTICED”

1. Colin King says:
14 November 2023 at 12:50

I’ve filed a bug upstream to track this
https://bugzilla.kernel.org/show_bug.cgi?id=218144

LikeLike

2. Morten Fjord Christensen says:
15 November 2023 at 06:58

The comments in the kernel are misleading or straight up wrong, but the code
isn’t. The time slices are limited to scale with up to 8 cores, to not get
huge time slices for tons of cores, but the scheduler itself can handle as
many cores as the system has.

LikeLike

3. technicalparadox says:
15 November 2023 at 07:31

The 8 core limit only applies to scaling of the time slices.

LikeLike

4. Paul Maidment says:
15 November 2023 at 08:03

Quite a find. Have you opened a bug, got a link to the bug?

LikeLike

5. Kris says:
15 November 2023 at 11:15

Your headline does not correctly describe the behavior you describe. Which
is a shame because it may be an important finding.

But then “Linux Scheduler Optimized For 8 Cores” doesn’t sell as well.

LikeLike

6. Zokkiretru says:
15 November 2023 at 15:15

Whoa, this is surprising!
Have you tried removing the limit and benchmarking it? I am curious how much
performance we are leaving on the table from that bug, but I don’t have any
high core count machines.

LikeLike

7. bmeneg says:
15 November 2023 at 16:03

This code is related only to the granularity time, increasing the number of
CPUs would only allow tasks to run longer in a certain core, which is not
the result you might want, since other tasks also want to run in that same
core and would be stalled for more time. As you well pointed, is an act of
balancing things, and having 8 as maximum number of CPUs for that specific
calculation seems to still be a fine choice. Maybe a new benchmark would
suggest another ceiling? Maybe. But that’s definitely not related to how
many cores the scheduler see/use/consider/…

Check what Peter Zijlstra replied to this exact subject a few years ago:
https://lore.kernel.org/all/20211102160402.GX174703@worktop.programming.kicks-ass.net/

LikeLike

8. auri0 says:
15 November 2023 at 16:17

Very interesting article! Thanks for sharing. This may be unrelated to your
discovery, however on 23rd June 2021 I noticed that Linux (Ubuntu)
multiprocessing times per number of CPUs enabled posed an anomaly … whereas
Windows acted as (I) expected. My (rather amateur) finding is posted on
github here: https://github.com/skyfielders/python-skyfield/issues/612

My complaint about Linux is expressed near the following quoted text:
“Windows processing for a Nautical Almanac would then consume a maximum of 2
x 9 = 18 seconds. True! Windows processing of Event Time Tables would then
consume a maximum of 2 x 16.5 seconds. True! But if you look at the Ubuntu
times, this is NOT the case… which implies that something else is going on…
interrupting the supposedly parallel processes. Very strange.”

LikeLike

9. THRWY says:
15 November 2023 at 17:28

As discussed on the Hacker News thread for this article[0], the title and
conclusion of this article are incorrect and seem to be based on a
misunderstanding.

What is capped to 8 is a scaling factor for the interval at which the
scheduler switches tasks on any given core. Not the amount of cores the
system can actually work with.

As correctly stated in this article, the fewer cores you have the quicker
those cores need to switch between tasks to maintain fluid, interactive
“feel” for the user and to maintain performance across multiple running
programs. However, as the core count increases from 8 to 16 to 128 and more,
as some modern servers may have, you actually don’t want to continue
increasing that interval factor – you don’t want individual cores bound to a
single task for whole seconds! Especially when those machines are under high
load, that would probably lead to reduced performance.

In fact, the comment currently visible in the last screenshot used in this
article states “but the relationship is not linear, so pick a second-best
guess”.

[0] https://news.ycombinator.com/item?id=38260935

LikeLiked by 1 person

thehftguy.com Open in urlscan Pro 192.0.78.25 Public Scan

Form analysis 4 forms found in the DOM

POST https://thehftguy.com/wp-comments-post.php

POST https://subscribe.wordpress.com

POST https://subscribe.wordpress.com

Text Content

thehftguy.com Open in urlscan Pro
192.0.78.25 Public Scan

Form analysis
4 forms found in the DOM