I assume by performance you mean CPU usage per io request. Each io call should require a switch to the kernel and back. When you do blocking io the switch back is delayed(switch to other threads while waiting), but not more taxing. How could it be possible for there to be a difference?
Because the kernel doesn’t like you spawning 100k threads. Your RAM doesn’t, either. Even all the stacks aside, the kernel needs to record everything in data structures which now are bigger and need longer to traverse. Each thread is a process which could e.g. be sent a signal, requiring keeping stuff around that rust definitely doesn’t keep around (async functions get compiled to tight state machines).
Specifically with io_uring: You can fire off quite a number of requests, not incurring a context switch (kernel and process share a ring buffer) and later on check on the completion status quite a number, also without having to context switch. If you’re (exceedingly) lucky no io_uring call ever cause a context switch as the kernel will work on that queue on another cpu. The whole thing is memory, not CPU, bound.
Anyhow, your mode of inquiry is fundamentally wrong in the first place: It doesn’t matter whether you can explain why exactly async is faster (I probably did a horrible job and got stuff wrong), what matters is that benchmarks blow blocking io out of the water. That’s the ground truth. As programmers, as a first approximation, or ideas and models of how things work are generally completely wrong.
Because the kernel doesn’t like you spawning 100k threads.
Why do you say this?
Your RAM doesn’t, either
Not if your stacks per thread are small.
Even all the stacks aside, the kernel needs to record everything in data structures which now are bigger and need longer to traverse.
These data structures must exist either in userland or the kernel. Moving them to the kernel won’t help anything. Also, many of these data structures scale at log(n). Splitting have the elements to userland and keeping the other half gives you two structures with log(n/2) so 2log(n/2) = log(n^2/4). Clearly that’s worse.
Each thread is a process which could e.g. be sent a signal, requiring keeping stuff around that rust definitely doesn’t keep around (async functions get compiled to tight state machines).
If signals were the reason async worked better, then the correct solution is to enable threads that opt-out of signals. Anything that slows down threads that isn’t present in an async design should be opt-out-able. The state-machines that async compiles to, do not appear inherently superior to multiple less stateful threads managed by a fast scheduler.
Specifically with io_uring: You can fire off quite a number of requests, not incurring a context switch …
As described here you would still need to do a switch to kernel mode and back for the syscalls. The extra work required from assuming processes are hostile to each other should be easy to avoid among threads known to have a common process as they are obviously not hostile to each other and share memory space anyway.
The synchronization required to handle multiple tasks should be the same regardless if they are being run on the same thread by a user land scheduler or if they are running on multiple threads with an os scheduler.
Anyhow, your mode of inquiry is fundamentally wrong in the first place: …
I’m not interested in saying that async is the best because it appears to work well currently. That’s not the right way to decide the future of how to do things. That’s just a statement of how things are. I agree, if your only goal is get the fastest thing now with no critical thought, then it does appear that async is faster. I am unconvinced it must fundamentally be the case.
Page size is 4k it doesn’t get smaller. The kernel can’t give out memory in more fine-grained amounts, at least not without requiring a syscall on every access which would be prohibitively expensive.
Anything that slows down threads that isn’t present in an async design should be opt-out-able.
That’s what async does. It opts out of all the things, including having to do context switches when doing IO.
As described here you would still need to do a switch to kernel mode and back for the syscalls.
No, you don’t: You can poll the data structure and the kernel can poll the data structure. No syscalls required. Kernel can do it on one core, the application on another, so in the extreme you don’t even need to invoke the scheduler.
I am unconvinced it must fundamentally be the case.
You can e.g. have a look at whether you can change the hardware to allow for arbitrarily small page sizes. The reaction of hardware designers will be first “are you crazy”, then, upon explaining your issue, they’ll tell you “well then just use async what’s the problem”.
You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !programmerhumor@lemmy.ml
Post funny things about programming here! (Or just rant about your favourite programming language.)
Rules:
Posts must be relevant to programming, programmers, or computer science.
No NSFW content.
Jokes must be in good taste. No hate speech, bigotry, etc.
I assume by performance you mean CPU usage per io request. Each io call should require a switch to the kernel and back. When you do blocking io the switch back is delayed(switch to other threads while waiting), but not more taxing. How could it be possible for there to be a difference?
Because the kernel doesn’t like you spawning 100k threads. Your RAM doesn’t, either. Even all the stacks aside, the kernel needs to record everything in data structures which now are bigger and need longer to traverse. Each thread is a process which could e.g. be sent a signal, requiring keeping stuff around that rust definitely doesn’t keep around (async functions get compiled to tight state machines).
Specifically with io_uring: You can fire off quite a number of requests, not incurring a context switch (kernel and process share a ring buffer) and later on check on the completion status quite a number, also without having to context switch. If you’re (exceedingly) lucky no io_uring call ever cause a context switch as the kernel will work on that queue on another cpu. The whole thing is memory, not CPU, bound.
Anyhow, your mode of inquiry is fundamentally wrong in the first place: It doesn’t matter whether you can explain why exactly async is faster (I probably did a horrible job and got stuff wrong), what matters is that benchmarks blow blocking io out of the water. That’s the ground truth. As programmers, as a first approximation, or ideas and models of how things work are generally completely wrong.
Why do you say this?
Not if your stacks per thread are small.
These data structures must exist either in userland or the kernel. Moving them to the kernel won’t help anything. Also, many of these data structures scale at log(n). Splitting have the elements to userland and keeping the other half gives you two structures with log(n/2) so 2log(n/2) = log(n^2/4). Clearly that’s worse.
If signals were the reason async worked better, then the correct solution is to enable threads that opt-out of signals. Anything that slows down threads that isn’t present in an async design should be opt-out-able. The state-machines that async compiles to, do not appear inherently superior to multiple less stateful threads managed by a fast scheduler.
As described here you would still need to do a switch to kernel mode and back for the syscalls. The extra work required from assuming processes are hostile to each other should be easy to avoid among threads known to have a common process as they are obviously not hostile to each other and share memory space anyway. The synchronization required to handle multiple tasks should be the same regardless if they are being run on the same thread by a user land scheduler or if they are running on multiple threads with an os scheduler.
I’m not interested in saying that async is the best because it appears to work well currently. That’s not the right way to decide the future of how to do things. That’s just a statement of how things are. I agree, if your only goal is get the fastest thing now with no critical thought, then it does appear that async is faster. I am unconvinced it must fundamentally be the case.
Have you tried?
Page size is 4k it doesn’t get smaller. The kernel can’t give out memory in more fine-grained amounts, at least not without requiring a syscall on every access which would be prohibitively expensive.
That’s what async does. It opts out of all the things, including having to do context switches when doing IO.
No, you don’t: You can poll the data structure and the kernel can poll the data structure. No syscalls required. Kernel can do it on one core, the application on another, so in the extreme you don’t even need to invoke the scheduler.
You can e.g. have a look at whether you can change the hardware to allow for arbitrarily small page sizes. The reaction of hardware designers will be first “are you crazy”, then, upon explaining your issue, they’ll tell you “well then just use async what’s the problem”.