Hi,
I tried searching but am coming up short.. I have a quadcore PC running Win XP and am looking for a solution where I can selectively assign jobs to each processor - i.e I want to pick one of my processors and tell it to exclusively work on some specific data crunching. I tried to find a library that will allow me to do so but with no luck.
Anyone know of a library that will let me assign jobs to specific processor? Any other approaches and solutions are greatly appreciated
Mutant_Fruit
December 9th, 2008, 04:17 AM
The operating system will do a much better job of this than you can. Why not let the OS decide?
cblind
December 9th, 2008, 04:59 AM
The operating system will do a much better job of this than you can. Why not let the OS decide?
we tried but are not getting performance expected (hoped for) results ... idea now is to see if we can dedicate core or two to do specific job....if I can olny figure out how to do it in C#
boudino
December 9th, 2008, 07:04 AM
As far as I know, there are several project of developing C# with native support for parallel computation, but there are not in production phase yet.
Jensecj
December 9th, 2008, 10:13 AM
C# 4.0 and .net 4.0 will ship with functions supporting multiple cores.
Untill then take a look here (http://blogs.msdn.com/pfxteam/) and here. (http://www.codeproject.com/KB/cs/aforge_parallel.aspx)
MadHatter
December 9th, 2008, 11:06 AM
we tried but are not getting performance expected (hoped for) results ... idea now is to see if we can dedicate core or two to do specific job....if I can olny figure out how to do it in C#
I would go out on a limb and say that your multithreaded implementation is the cause of the performance problem. dedicating work to a single processor is usually never going to be as efficient as letting the operating systems thread scheduler deal with your program. doing multithreaded programming bad is worse IMHO than doing singlethreaded programming well.
if scalability is what your ultimate goal is, dictating what cpu is doing what is going to over complicate your implementation and criple its ability to scale as you'll need to implement what's already implemented in the OS (and I'd venture to guess that the authors of the windows kernel know a thing or two more about concurrent processing than you might). Your app can run fine w/ you doing your own scheduling for 2 CPU's, or 4 CPU's but what if you throw your app on a 16 CPU system? does your app make use of a max of 4 CPU's and leave the other 12 CPU's unattended?
having said that, you can spawn processes (small console exe's) that do the work. doing so allows you to specify processor affinity and thread priority.
ProcessStartInfo psi = new ProcessStartInfo("notepad.exe");
Process p = new Process();
p.StartInfo = psi;
p.Start();
byte cpu = 0x8; // bitmask for processors: 1 = cpu 1, 10 (0x2) = cpu 2, 100 (0x4) = cpu 3, 1000 (0x8) = cpu4
p.ProcessorAffinity = new IntPtr(cpu);
or you can simply set it in the worker exe:
using System;
using System.Diagnostics;
namespace CpuSpecificTask {
class Program {
static void Main(string[] args) {
Process.GetCurrentProcess().ProcessorAffinity = new IntPtr(0x08);
Console.ReadLine();
}
}
}
cblind
December 9th, 2008, 01:01 PM
Hi,
thanks all!
Boudino - would you mind sending me references (links) to the projects?
Jensecj - thanks ! that article (aforge) is very informative, as well as video links
MadHater - all good points and observations. I am not ruling out implementation as an issue although best practices were followed. Currently we let OS deal with threads but expected performance is not there. Perhaps we are asking too much from it....
Mutant_Fruit
December 9th, 2008, 01:25 PM
Forcing your application to only run on a specific core can give you significantly worse performance. Suppose another developer has decided that forcing his app to use thread0 on core0 and thread1 on core1 and you do the exact same. Now imagine you're running on a quadcore. You've just halved the performance of your application for no reason other than the mistaken belief that forcing a thread to operate on a specific core is a good thing.
There are limited scenarios where setting processor affinity may be helpful, but those scenarios don't exist on a standard end-user computer system.
If you're not getting the performance you're expecting, it's much more likely that your expectations are wrong, or your threading implementation is sub-optimal.
cjard
December 9th, 2008, 08:38 PM
best practices were followed
Care to list them?
Programming by numbers?
Currently we let OS deal with threads but expected performance is not there. Perhaps we are asking too much from it....
Or perhaps you don't know how to ask for it, or your code isn't as good as you think it is. The best solution would be to let the guys here have a look at it.. i.e. let them see the actual problem, not the problem with your envisaged solution. All too often we see this:
Code is broken
Solution is imagined
Solution is broken
Get help for fixing broken solution
It should be:
Code is broken
Get help for fixing broken code
TheCPUWizard
December 9th, 2008, 09:05 PM
Just as an FYI, I write a number of high performance (ie realtime analysis of laboratory instrumentation) application in C# that are massively parallallized. 95% of the time I do not have to manually deal with assignments to spcecific cored (there is one video application soon to be poerted to a dedicated controller that is the exception).
Usage of the Parallel Extensions library, along with a synthesized "Futres and Promises" architecture allow me to utilize all of the execution paths (the production lab has a sweet 16 core system) running on both my Q6600 and (brand new) i7 development machines.
cblind
December 10th, 2008, 02:40 AM
All valid points and questions
Care to list them?
Programming by numbers?
I am not quite sure I understand reference to "programming by the numbers".... I realize that "best practices" statement is a bit general / umbrella statement. By that I meant that code documentation is done well, care was taken not to have race conditions, modular development, etc.
I would love if I could post the code for peer review and help, but I can't for two main reasons:
- it's huugeeee; it wouldn't be appropriate for typical forum code snippet (If I was sure where the problem in the code is I would post only that part, but we don't know :) :(
- I don't think company bosses would appreciate company's secret souse in public domain
There are 3 people working on the code - so it's manageable from standpoint of review, and who is doing what and where in the code (reference to an earlier post)
But there is performance issue for some reason....hence my original post.
Perhaps to clarify things - we have big app which is doing few things, including (desired) real time video processing - that is part of the app that I would like to dedicate core or two for. Currently we are able to get about 7 frames per second (and that is pushing it). I would like it to be 2x or 3x better.
Ideally I would like us to move to RT OS / embedded target, but it's not an option for now. So we are left with trying to squeeze / optimize things on PC box....
Mutant_Fruit
December 10th, 2008, 03:59 AM
Unless you own 100% of the processes on the machine, you should not force a specific process to a specific core. You won't increase performance and you run the risk that some other developer has had the same (bad) idea and then both of your programs will suffer a huge performance penalty as they fight for the same core despite having several completely free cores.
If you want faster encoding, the physical encoer itself must be multithreaded. Other than buffering input and output, there is absolutely nothing else you can do to speed up the encoder (other than reducing the quality settings). Once again, if you aren't getting the performance you're expecting, either your expectations are wrong, you're code doesn't benefit from multiple cores or you've just written it badly. Forcing a process to a core will not help. It definitely won't give you the 2x or 3x boost you appear to be looking for.
ZOverLord
December 10th, 2008, 05:58 AM
Actually, this is not true.
For many years methods have been used to manage multi-processor based systems very well actually, using factors.
In some cases, you can gain the power of 1 CPU for every 4 managed on very poorly managed multi-processor based systems. This also applies to entire networks, where systems have the ability to launch processes on other systems in the network.
More here: http://www.overlord.com/Balancer_Brochure.html
While the above maybe for other muti-processor based systems, the same concepts apply to PC based muti-processor systems as well as any multi-processor based system.
I designed and created the software above ("After Black-Monday for the NYSE") and it is used by many companies world wide including branches of the U.S. Military, it is proven fact that managed muti-processor systems perform better than unmanaged ones.
Mutant_Fruit
December 10th, 2008, 08:35 AM
it is proven fact that managed muti-processor systems perform better than unmanaged ones.
Sure, that can be true for systems where you own and manage 100% of the active threads. However, on a standard server or desktop machine, you own a few threads out of hundreds. Therefore by specifying exactly which core you should run your process on you can cripple your performance.
So don't do this on a regular desktop/server.
ZOverLord
December 10th, 2008, 10:10 AM
This has nothing to do with owning or managing 100% of the active threads, again, your statement is not true.
A vast majority of the companies that use these automatic load balancing techniques, have lots of 3rd party software, and never have seen or will see the source code for that software, and are not in 100% control of what processes or threads will be launched on their system by other systems in the network, let alone what processes will launch at specific times, since most of the load is user driven which makes it very dynamic.
This statement is completely false:
"You won't increase performance and you run the risk that some other developer has had the same (bad) idea and then both of your programs will suffer a huge performance penalty as they fight for the same core despite having several completely free cores."
Having properly weighted factors used to determine process/thread placement in cores for the enviornment you are in, gives you much better odds of increasing your performance, if even for a short period of time, than blind-luck does.
It also would make it impossible to "fight for the same core despite having several completely free cores" in all cases, again, worse case being better off in the short term than long term.
Having this kind of load-balancing done on a network/system basis would be the preferred choice, however this is a vaild seconday choice as well for a process that launches many processes as was stated here, when this network/system load-balancing ability is not present already where this process in question is going to be running.
This is not speculation on my part, this comes from over 20 years of experience of designing, selling, testing and having customers doing this on many different systems with many different combinations of software and workloads.
This is Fact not Fiction ;-) Had you looked at the link I provided in my prior post you would have understood that, by seeing the testements of corporate customers using these techniques and the processing advantages they have gained and the hardware costs they have saved.
Mutant_Fruit
December 10th, 2008, 12:31 PM
This statement is completely false:
"You won't increase performance and you run the risk that some other developer has had the same (bad) idea and then both of your programs will suffer a huge performance penalty as they fight for the same core despite having several completely free cores."
Let me make my position clearer:
On a standard desktop (or standard server), if you app1thread1 is forced to execute on core0 and app2thread1 is forced to execute on core0, then both of those threads will execute on core0. If you have a 10000 core box, those two threads will still execute on core0.
Therefore both applications will have 1/2 the performance they should have as compared to letting the OS schedule the two threads onto core0 and core1 respectively.
So yes, my statement is correct. Sure, you can employ third party software which will change the processor for active threads/processes, but then you're just substituting the OS scheduler for a third party scheduler. Sure, this can work. The third party scheduler could easily be better. But unless you employ this software, you *should not* force threads to execute on a specific core.
EDIT:
If you have third party software which manages which thread operates on which core, then trying to set affinity in code is a waste of time as the third party scheduler will just override that. So you've just weakened the argument for trying to set the affinity through code.
Having properly weighted factors used to determine process/thread placement in cores for the enviornment you are in, gives you much better odds of increasing your performance, if even for a short period of time, than blind-luck does.
It also would make it impossible to "fight for the same core despite having several completely free cores" in all cases, again, worse case being better off in the short term than long term.
Unfortunately these properly weighted factors will not be available in usercode.
TheCPUWizard
December 10th, 2008, 12:40 PM
I think the two points of view have finally converged (I have been watching for a while).
IMHO:
1) There are definately viable alternatives to the "built-in" Widnwos scheduling/allocation of resources. These take into accound the entire environment on the system, and are often very sophesticated. For many system these may be much better.
2) Attempting to "coerce" things from inside an application without having a view of thge total system is likely to have a (potentially severe) negatvie impact when used on "normal" desktops or servers.
Both of these mimic my own personal/profession experiences.
cblind
December 10th, 2008, 02:33 PM
Thanks for valuable insights !
To further clarify our particular predicament: we have PC box running Win Xp OS and our custom application. In our case we don't have an issue of multiple developers "overbooking" one core while others are "idle" - it's less then handful of people working on the app, code is reviewed, and overall development is tightly managed...What we want, is to free a core (or two) to just process video in real time (video processing is a module inside of our app) - 'hope' is that video thread will not have to compete for resources but rather have dedicated space to it's thing.
TheCPUWizard
December 10th, 2008, 02:38 PM
Thanks for valuable insights !
To further clarify our particular predicament: we have PC box running Win Xp OS and our custom application. In our case we don't have an issue of multiple developers "overbooking" one core while others are "idle" - it's less then handful of people working on the app, code is reviewed, and overall development is tightly managed...What we want, is to free a core (or two) to just process video in real time (video processing is a module inside of our app) - 'hope' is that video thread will not have to compete for resources but rather have dedicated space to it's thing.
You are missing the poiint. To be effective,the code MUST be aware of every process runing on the system. One my current system there are 53 processes running WHEN THE SYSTEM IS IDLE( as counted by TaskMgr)., and many of these are multi-threaded
To dedicate a processor to YOUR work, you would have to impact the processing of all of these processes/threads. Unless you coordinate ALL of the threads of ALL of the processes, you are NOT going to achieve your goal.
Mutant_Fruit
December 10th, 2008, 02:57 PM
In our case we don't have an issue of multiple developers "overbooking" one core while others are "idle"
Just so long as you're sure that no *other* developer working on a completely separate project, (maybe from a completely separate company) decides to set thread affinity on his application so he can do complex processing on a specific core... which happens to be the one you choose too.
Go benchmark the difference, the Process class has a method for setting thread affinity. Benchmark your app with no affinity set on the video thread and with affinity set. The difference should be negligible.
ZOverLord
December 10th, 2008, 05:18 PM
Let me make my position clearer:
On a standard desktop (or standard server), if you app1thread1 is forced to execute on core0 and app2thread1 is forced to execute on core0, then both of those threads will execute on core0. If you have a 10000 core box, those two threads will still execute on core0.
Therefore both applications will have 1/2 the performance they should have as compared to letting the OS schedule the two threads onto core0 and core1 respectively.
So yes, my statement is correct. Sure, you can employ third party software which will change the processor for active threads/processes, but then you're just substituting the OS scheduler for a third party scheduler. Sure, this can work. The third party scheduler could easily be better. But unless you employ this software, you *should not* force threads to execute on a specific core.
EDIT:
If you have third party software which manages which thread operates on which core, then trying to set affinity in code is a waste of time as the third party scheduler will just override that. So you've just weakened the argument for trying to set the affinity through code.
Unfortunately these properly weighted factors will not be available in usercode.
Again, you keep making the auto-balancing code appear stupid and brain dead.
If you add balancing code to a process that is going to launch many other processes, you don't force things to go to the worst core so that you run 50 percent worse than without your balancing code.
Again, the best method would be to have a process do ALL core management, but as a secondary solution adding balancing logic to a program that is responsible to launch many other processes over its runtime life is a good idea, especially when you are trying to exclude some cores so that they won't have the workload of the processes you are launching added to them.
It's common sense, you can't "Dumb That Up!" You also can't make claims like "Let Windows manage it, because you will screw it up somehow" Windows can't auto-magically manage this, because there is core exclusion required, get it?
The goal is to use factors to choose the "Best" core at that time, that's the ONLY goal, ("The program in question here is NOT a system resource management application") which could and will become worse overtime ("Because the program in question is not globally managing all processes"), but at that moment is the "Best" which may include the exclusion of some cores as well.
Also any and all information required to make create these properly weighted factors is avalable via user code.
What system information are you claiming is not available via user code to create these weighted factors, exactly?
ZOverLord
December 10th, 2008, 05:36 PM
You are missing the poiint. To be effective,the code MUST be aware of every process runing on the system. One my current system there are 53 processes running WHEN THE SYSTEM IS IDLE( as counted by TaskMgr)., and many of these are multi-threaded
To dedicate a processor to YOUR work, you would have to impact the processing of all of these processes/threads. Unless you coordinate ALL of the threads of ALL of the processes, you are NOT going to achieve your goal.
The code in question does "Not" need to be aware of "every process running on the system" it could be, but it is "Not" required to select the best core and exlcude others when it does this.
If the goal is to limit impact on specific cores from a process that will launch many other processes during its lifetime, then you CAN achieve your goal, because those specific cores will now have NONE of your processes you launched running in them. They may have other processes running in them, but none would be yours. So you would in fact acheive your goal. The goal for this application is not system wide resource management, but....it does include leaving some cores NOT running any of its created processes.
Here is a perfect example:
Say you have 8 cores and wish to dedicate 2 to development and leave the other 6 to your user base. You have some development applications that create many processes over time and you want to add balancing code to them.
You have no need to have knowledge of every process running in every core, or even the 2 cores you wish to use for development applications. All you need are;
1. What cores are these processes allowed to run in.
2. Using factors created from "System Level" information not down to the level of process information select the best core of the 2 allowed.
One would be hard pressed to make claims that the user base, now does not have better resonse time. Of course the development team could be screaming bloody murder, but even in a worse case, you could make the available cores be 3 not 2 then.
The ability to isolate and exclude cores here, is also being missed here. "built-in" Windows resource management does not deal with this issue, auto-magically.
There also is no need to "Micro-Manage" this balancing down to a process level, that kind of detail is in most cases Over-Kill.
Factors and weighting them properly are important as well, simple things which most programmers don't know can hurt you.
One example is cores/CPU's can be using "Mutual Exclusion" for things like memory management, as one example, more than some others, so they will never obtain a high CPU busy rate for example, yet they are the worst cores to choose because they are stalled so much. You can assign a lower weight to this factor so that when using factors to select cores, these cores will also be more likely to be exlcuded.
Factors allow you to tune the balancing for your enviornment and the workload it is exposed too.
Not to be insulting, but....so far I have seen one person here, who has used "libraries" to provide some level of this "Once" in their entire lifetime. This is the only "Real-Life" experience I have seen in this thread so far of using your own balancing methods for process launch.
I have designed and coded a commercial product that does this today for many comapnies wordwide and has been doing this for over 20 years now, with systems that have 2-16 cores per node and can have over 2,000 nodes per network. That's my real-life experience.
I have seen NO supporting evidence of any kind, that shows that in this case, adding this balancing code would not be productive and a benefit to this system in question, nor have I seen anyone explain how to use "built-in" Windows resource management to exclude cores, auto-magically, without code for this case. What I have seen is dire warnings on how you will "hurt" yourself in some way, yet no detail is provided other than statements like "You will be 50 percent slower".
TheCPUWizard
December 10th, 2008, 05:57 PM
ZOverLord
Your last reply directly contradicts my experiences over the past decade (including using high-end machines such as IBM's 64 (Xeon) processor servers).
I have personally (professionally) experienced many parasitic conditions where only a single "application" (implemented as multiples processes that were each multi-threaded) was running on the system, but SYSTEM Processes wer left to the Operating Systems "whim" (this has been true on Windows, Solaris and Unix based systems).
While the types of management you are promoting can appear to yield "better" performance on a statistical basis, the real test comes down to what is the actual WORST case performance that can be induced.
For example, on a completely "unmanaged" (no affinity control) basis, the timing for an operation may have a normal distribution curve over the 3-5 second range (centered at 4 seconds).
When affinity is applied, 99.9% of the sampled meansurements may be between 1-3 seconds (centered around 2 seconds), but with a 0.1% mearuement rate of 10 seconds.
For a true real-time system, this means that the "managed" system did NOT double the performance (from 4 seconds, to 2 seconds), it actually slowed it down by a factor of 2 (from a worst case of 5 seconds to a worst case of 10 seconds).
Of course these numbers are simply for illustrative purposes (and easy calculations ;)) but the situation is quite real.
FWIW: In the case of the IBM Xeon system, we catually removed 48 of the processors from the operating system ENTIRELY and implementing "bare-metal" single threaded code for the real-time requirements (effectively making them dedicated microcontrollers.....This achieved a significant (4x) improvement over the best that could be achieved with processes running on top of an OS.
ZOverLord
December 10th, 2008, 06:18 PM
ZOverLord
Your last reply directly contradicts my experiences over the past decade (including using high-end machines such as IBM's 64 (Xeon) processor servers).
Please show us respected white papers that support this theory, links please. There should be many for statements like these:
"For a true real-time system, this means that the "managed" system did NOT double the performance (from 4 seconds, to 2 seconds), it actually slowed it down by a factor of 2 (from a worst case of 5 seconds to a worst case of 10 seconds)."
"While the types of management you are promoting can appear to yield "better" performance on a statistical basis, the real test comes down to what is the actual WORST case performance that can be induced."
These statement make no commone sense, statement 1 is saying that you will have a 200 percent performance impact by attempting to manage a system, show me respected white papers please that say this. Statement 2 says, don't auto-balance, simply define the worse case and use that for everyday core selection. Again, show me respected white papers that say this please.
Are you reading what you are writing?
Secondly, please explain how you would exclude cores as well in this case, without code.
Also, I never made a claim that adding balancing logic to a system will double the performance of a system. Not sure where you got that from?
Thanks.
TheCPUWizard
December 10th, 2008, 06:30 PM
The quickest reference is factually right in the current issue of MSDN, where the difference betweej "Fairness" and "Total Performance" is discussed. Most of the detailed material I have is either subject to NDA or otherwise difficultto post, but I will look for some when I am back in the office.
It is acutally very simple to demonstrate with even simple applications that keep track of the WORST CASE time to perfom a given task. Simply run an number of instances equal to the number of cores. Let them run without doing any "affinity". Then start "locking" various instances to specific cores.
As far as excluding the cores, it is a fairly simple matter to adjust the HAL so that it simply does not report their existance to the operating system. As far as the OS is concerented a Q660 can easily become a 1,2,3 cor system, of an i7 a 2,4,6 "core" systems (due to some of the windows internals, we have NOT found a viable method of blocking a single hyper threaded pathway on a specific core, and that scenario is very unlikely to provide benefit in any case.
ZOverLord
December 10th, 2008, 06:34 PM
The quickest reference is factually right in the current issue of MSDN, where the difference betweej "Fairness" and "Total Performance" is discussed. Most of the detailed material I have is either subject to NDA or otherwise difficultto post, but I will look for some when I am back in the office.
It is acutally very simple to demonstrate with even simple applications that keep track of the WORST CASE time to perfom a given task. Simply run an number of instances equal to the number of cores. Let them run without doing any "affinity". Then start "locking" various instances to specific cores.
As far as excluding the cores, it is a fairly simple matter to adjust the HAL so that it simply does not report their existance to the operating system. As far as the OS is concerented a Q660 can easily become a 1,2,3 cor system, of an i7 a 2,4,6 "core" systems (due to some of the windows internals, we have NOT found a viable method of blocking a single hyper threaded pathway on a specific core, and that scenario is very unlikely to provide benefit in any case.
The article in MSDN does not support your statements that I reference in my previous post and the goal here is not to hide cores from all processes by hiding them from the operating system, it is to limit launches of other processes to them from this process in question.
Hi,
I tried searching but am coming up short.. I have a quadcore PC running Win XP and am looking for a solution where I can selectively assign jobs to each processor - i.e I want to pick one of my processors and tell it to exclusively work on some specific data crunching. I tried to find a library that will allow me to do so but with no luck.
Anyone know of a library that will let me assign jobs to specific processor? Any other approaches and solutions are greatly appreciated
I would avoid most libraries if possible and roll your own. While this is a thread balancing example, the point still stands:
"Initially, the thread control was implemented using Microsoft's Parallel Extensions Library. This made life a little easier, but unfortunately, performance was terrible."
Just one of many complaints you will find about this library.
Instead create your own factors, as in this example, because you will minimize the overhead of using a library that maybe doing far more than what you need, and the overhead of that library may exceed the benefit it could provide you for this case in question:
Sometimes when people create libraries, they try to be all things to all people, which can defeat the purpose they are created for. Using the techinques above, you should be able to create important factors to use for weighting and you will have a much faster system response time, if done right ("Which is not complicated or hard to do"), which will also create the isolation of cores as you wish as well.
Mutant_Fruit
December 10th, 2008, 07:52 PM
I think you're confusing the issue.
A standard desktop only has the windows scheduler. It has no other way to automatically set thread affinity. In this scenario, trying to micromanage threads is 'wrong' for the reasons I have very clearly laid out already. Therefore under these circumstances trying to micromanage which core your process works on is both 'stupid and braindead' as you pointed out.
Adding balancing code to automatically adjust processor affinity to 'choose the best core' on a standard desktop is both 'stupid and braindead' too. Why would you replicate a complex feature which the operating system already provides for free? Especially when that code is already running under the windows scheduler and so its performance is already constrained by the windows scheduler.
Again, the best method would be to have a process do ALL core management, but as a secondary solution adding balancing logic to a program that is responsible to launch many other processes over its runtime life is a good idea, especially when you are trying to exclude some cores so that they won't have the workload of the processes you are launching added to them.
No. How can your process tell which core is the best? Implement complex algorithms which will have a 50/50 chance of failing? How can you tell how busy a core really is when the windows sceduler can pre-empt any of your code for an unknown (and potentially lengthy) amount of time? You can't do this from usercode. Core exclusion is exactly the same as setting affinity to one (or more) cores. You can still very easily 'exclude' yourself from the only free core.
Say you have 8 cores and wish to dedicate 2 to development and leave the other 6 to your user base. You have some development applications that create many processes over time and you want to add balancing code to them.
That's fine for development. It's your machine, you know what's running on it, you can have a reasonable idea which core is 'best'. But for release you don't have any idea what the end-user is running on their desktop and so you shouldn't do this. Otherwise you run the risk of crippling performance artificially.
So my point is that if you are designing an application for the standard desktop or standard server, you are shooting yourself in the foot by trying to micomanage the cores your threads operate on.
I have designed and coded a commercial product that does this today for many comapnies wordwide and has been doing this for over 20 years now, with systems that have 2-16 cores per node and can have over 2,000 nodes per network. That's my real-life experience.
This is an exercise in learning how to multi-thread an applications workflow. It deals with splitting a task up into decent sized chunks which can then be executed in parallel. It isn't really relevant to the discussion unless you want to use it as an example of how you can achieve great performance by letting windows automatically manage which core your thread executes on...
A standard desktop only has the windows scheduler. It has no other way to automatically set thread affinity. In this scenario, trying to micromanage threads is 'wrong' for the reasons I have very clearly laid out already. Therefore under these circumstances trying to micromanage which core your process works on is both 'stupid and braindead' as you pointed out.
First, we are NOT talking about threads here, we are talking about launching processes. Secondly, A standard desktop has more than the windows scheduler. Here is just one example of many. But one example trumps "no other way ", so my point has been made.
Adding balancing code to automatically adjust processor affinity to 'choose the best core' on a standard desktop is both 'stupid and braindead' too. Why would you replicate a complex feature which the operating system already provides for free? Especially when that code is already running under the windows scheduler and so its performance is already constrained by the windows scheduler.
False. It is easy to do, and is smart when you wish to exlude cores as well.
If it was 'stupid and braindead' as you say then why is load-balancing part of the next version of Visual Studio 2010? Is Microsoft 'stupid and braindead' as well?
from: http://channel9.msdn.com/pdc2008/TL26/
"The manycore shift presents an unprecedented business opportunity for developers to design new software experiences that take advantage of the performance power of manycore architectures. At the same time, parallel programming is complex, difficult and labor-intensive, for even the most skilled developers."
No. How can your process tell which core is the best? Implement complex algorithms which will have a 50/50 chance of failing? How can you tell how busy a core really is when the windows sceduler can pre-empt any of your code for an unknown (and potentially lengthy) amount of time? You can't do this from usercode. Core exclusion is exactly the same as setting affinity to one (or more) cores. You can still very easily 'exclude' yourself from the only free core.
You can do this from usercode. There is nothing complex about it for most people.
"Performance Counter Walkthroughs
The walkthroughs in this section introduce you to the use of the PerformanceCounter component. The walkthroughs show you how to use both system performance counters and custom performance counters.
That's fine for development. It's your machine, you know what's running on it, you can have a reasonable idea which core is 'best'. But for release you don't have any idea what the end-user is running on their desktop and so you shouldn't do this. Otherwise you run the risk of crippling performance artificially.
Microsoft is NOT adding this to Visual Studio 2010 for development machines, it's doing it to help you load-balance on your users system.
"Come learn how the next version of Visual Studio and the Microsoft .NET Framework can help you write better performing and more scalable applications."
from: http://channel9.msdn.com/pdc2008/TL26/
So my point is that if you are designing an application for the standard desktop or standard server, you are shooting yourself in the foot by trying to micomanage the cores your threads operate on.
My point is, every, not some, statements you have made in this post, have no factual basis at all, as in none, nada, zip.
Unlike you, I provided links to prove that. Now you can call Microsoft crazy too. My intent is to provide facts to the original poster, not conjecture, or unsupported statements.
I think I have done that, with your help, of course.
This is not a standard server or desktop.
There is no standard server or desktop. What would they be? What do they look like?
Please list the software restrictions and limitation on "Standard" servers and desktops.
ZOverLord
December 10th, 2008, 08:45 PM
This is an exercise in learning how to multi-thread an applications workflow. It deals with splitting a task up into decent sized chunks which can then be executed in parallel. It isn't really relevant to the discussion unless you want to use it as an example of how you can achieve great performance by letting windows automatically manage which core your thread executes on...
I'm not sure what relevance this has at all.
I provided the links, I know what the page contents are about, the point in the muti-thread link was to show how the libraries failed to provide a perfomance improvement. Which means, in some cases, it maybe better, to create your own methods, for your core selection than use libraires.
The second link shows how to create factors. I am sure the person that started this thread can see that.
Mutant_Fruit
December 11th, 2008, 04:18 AM
First, we are NOT talking about threads here, we are talking about launching processes. Secondly, A standard desktop has more than the windows scheduler. Here is just one example of many. But one example trumps "no other way ", so my point has been made.
Unfortunately, that's not a scheduler. So yes, the windows scheduler is still the only thing that can bounce threads around onto the 'best' core. I did make a confusing statement alright in my last post, what i was trying to say is that there's no way to automatically bounce threads between cores so that all threads are spread evenly. Of course, there's a way to automatically set affinity. C# exposes this, the task manager exposes this (if you care to check it out) and any program can expose this facility.
However, you're still left with the extremely difficult task of finding the best core for each and every thread.
If it was 'stupid and braindead' as you say then why is load-balancing part of the next version of Visual Studio 2010? Is Microsoft 'stupid and braindead' as well?
The walkthroughs in this section introduce you to the use of the PerformanceCounter component. The walkthroughs show you how to use both system performance counters and custom performance counters.
Performance counters can give you this information on cpu usage etc, but if you want to even attempt to utilise this you are essentially implementing a userland scheduler, which runs under the windows scheduler, and as I said before - will be constrained by the windows scheduler.
Microsoft is NOT adding this to Visual Studio 2010 for development machines, it's doing it to help you load-balance on your users system.
Load balancing is good - setting tasks to a specific core is bad unless you know that no other application will be choosing that specific core.
There is no standard server or desktop. What would they be? What do they look like?
Not an embedded system. Not one which has 1000 cores in a single box. One which does not use a third party scheduler which overrides cpu affinity set by a program.
I provided the links, I know what the page contents are about, the point in the muti-thread link was to show how the libraries failed to provide a perfomance improvement.
I have seen benchmarks where the the library was used incorrectly/badly.
The parallel library can improve performance significantly when it's used for a suitable task. Once again, this has nothing to do with setting specific tasks to specific cores.
1) So, setting a specific task to a specific core is still bad.
2) Yes you can get performance counter information from windows - no i really don't believe you can implement a userland scheduler which will outperform the built in scheduler. However, i haven't ever seen a userland scheduler nor am I going to implement such a thing to prove my point.
3) Splitting up a large task and letting the OS schedule that work onto the best core is good. It's what people have been doing for years and will continue do to for decades to come.
4) It is still hard to tell from usercode which core is good because you never know what tasks will be starting/stopping at any given instant. A userland scheduler will just be a poor and slow imitation of the windows scheduler.
And that's me for this thread. I can't make my point any clearer and I have yet to see a good rebuttal for my basic point:
If you force an intensive task to core0 and there's already an intensive task on core0 - you've just cut your performance.
ZOverLord
December 11th, 2008, 04:59 AM
Unfortunately, that's not a scheduler. So yes, the windows scheduler is still the only thing that can bounce threads around onto the 'best' core. I did make a confusing statement alright in my last post, what i was trying to say is that there's no way to automatically bounce threads between cores so that all threads are spread evenly. Of course, there's a way to automatically set affinity. C# exposes this, the task manager exposes this (if you care to check it out) and any program can expose this facility.
However, you're still left with the extremely difficult task of finding the best core for each and every thread.
Load balancing != choosing a specific core.
Parallel programming != choosing specific cores.
Performance counters can give you this information on cpu usage etc, but if you want to even attempt to utilise this you are essentially implementing a userland scheduler, which runs under the windows scheduler, and as I said before - will be constrained by the windows scheduler.
Load balancing is good - setting tasks to a specific core is bad unless you know that no other application will be choosing that specific core.
Not an embedded system. Not one which has 1000 cores in a single box. One which does not use a third party scheduler which overrides cpu affinity set by a program.
I have seen benchmarks where the the library was used incorrectly/badly.
The parallel library can improve performance significantly when it's used for a suitable task. Once again, this has nothing to do with setting specific tasks to specific cores.
1) So, setting a specific task to a specific core is still bad.
2) Yes you can get performance counter information from windows - no i really don't believe you can implement a userland scheduler which will outperform the built in scheduler. However, i haven't ever seen a userland scheduler nor am I going to implement such a thing to prove my point.
3) Splitting up a large task and letting the OS schedule that work onto the best core is good. It's what people have been doing for years and will continue do to for decades to come.
4) It is still hard to tell from usercode which core is good because you never know what tasks will be starting/stopping at any given instant. A userland scheduler will just be a poor and slow imitation of the windows scheduler.
And that's me for this thread. I can't make my point any clearer and I have yet to see a good rebuttal for my basic point:
If you force an intensive task to core0 and there's already an intensive task on core0 - you've just cut your performance.
The links I provided are based on facts, the orginal poster can determine the difference between your unsubstantiated statements and facts.
Again you stated "A standard desktop only has the windows scheduler. It has no other way to automatically set thread affinity" One would need to agree that by forcing a process into a specific core, that another method is present to "automatically set thread affinity" since the process itself can be placed in a different core then the windows scheduler may have placed it in. But you can't see that.
My Links stand as fact, supported by Microosft as well.
None as in "Zero" of Your point(s), besides saying that the libraries can provide performance improvements ("Which I never said is not true, I did state that this is not true in all cases, which does not mean never, lol") are not supported by facts.
I have asked for links to white papers or links to articles that specifically support your allegations, and so far, you have produced none. Which does not surprise me at all.
Suddenly, you are quoting links about how good the libraries are, yet you call these concepts "stupid" in your other posts, and you refuse to "Own Up" to the fact that Visual Studio 2010 will allow you to do these "Same" stupid things, lol.
You also, still, are not capable to comprehend, that the original poster is talking about process load balancing , not thread load balancing, and they would also like to exlclude cores from the mix of cores that will get processes spawned by their application as well, but leave those cores avaliable for other processes.
Your solution, is to "Fool" the "Entire" operating system into thinking that some cores are not present, this also does not surprise me.
Yes, using your methods, could, in fact, cause a 200 percent slow down, and more, in processing, because when you make entire cores go away for any and all processing on a system, that kind of stuff happens.
So, I will leave you with your opinions, as well as many links here, that support the facts.
I will answer any questions I can, if the creator of this thread would like my input, but it is "Fruitless" to banter with your unsubstantiated opinions and beliefs.
toraj58
December 11th, 2008, 04:02 PM
The amount of performance gained by the use of a multicore processor depends on the problem being solved and the algorithms used, as well as their implementation in software (Amdahl's law). For so-called "embarrassingly parallel" problems, a dual-core processor with two cores at 2GHz may perform very nearly as fast as a single core of 4GHz. Other problems though may not yield so much speedup. This all assumes however that the software has been designed to take advantage of available parallelism. If it hasn't, there will not be any speedup at all. However, the processor will multitask better since it can run two programs at once, one on each core.
In addition to operating system (OS) support, adjustments to existing software are required to maximize utilization of the computing resources provided by multi-core processors. Also, the ability of multi-core processors to increase application performance depends on the use of multiple threads within applications. The situation is improving: for example the American PC game developer Valve Corporation has stated that it will use multi core optimizations for the next version of its Source engine, shipped with Half-Life 2: Episode Two, the next installment of its Half-Life series., and Crytek is developing similar technologies for CryEngine 2, which powers their game, Crysis. Emergent Game Technologies' Gamebryo engine includes their Floodgate technology which simplifies multicore development across game platforms. See Dynamic Acceleration Technology for the Santa Rosa platform for an example of a technique to improve single-thread performance on dual-core processors.
Integration of a multi-core chip drives production yields down and they are more difficult to manage thermally than lower-density single-chip designs. Intel has partially countered this first problem by creating its quad-core designs by combining two dual-core on a single die with a unified cache, hence any two working dual-core dies can be used, as opposed to producing four cores on a single die and requiring all four to work to produce a quad-core. From an architectural point of view, ultimately, single CPU designs may make better use of the silicon surface area than multiprocessing cores, so a development commitment to this architecture may carry the risk of obsolescence. Finally, raw processing power is not the only constraint on system performance. Two processing cores sharing the same system bus and memory bandwidth limits the real-world performance advantage. If a single core is close to being memory bandwidth limited, going to dual-core might only give 30% to 70% improvement. If memory bandwidth is not a problem, a 90% improvement can be expected. It would be possible for an application that used two CPUs to end up running faster on one dual-core if communication between the CPUs was the limiting factor, which would count as more than 100% improvement.
Managing concurrency acquires a central role in developing parallel applications. The basic steps in designing parallel applications are:
Partitioning
The partitioning stage of a design is intended to expose opportunities for parallel execution. Hence, the focus is on defining a large number of small tasks in order to yield what is termed a fine-grained decomposition of a problem.
Communication
The tasks generated by a partition are intended to execute concurrently but cannot, in general, execute independently. The computation to be performed in one task will typically require data associated with another task. Data must then be transferred between tasks so as to allow computation to proceed. This information flow is specified in the communication phase of a design.
Agglomeration
In the third stage, we move from the abstract toward the concrete. We revisit decisions made in the partitioning and communication phases with a view to obtaining an algorithm that will execute efficiently on some class of parallel computer. In particular, we consider whether it is useful to combine, or agglomerate, tasks identified by the partitioning phase, so as to provide a smaller number of tasks, each of greater size. We also determine whether it is worthwhile to replicate data and/or computation.
Mapping
In the fourth and final stage of the parallel algorithm design process, we specify where each task is to execute. This mapping problem does not arise on uniprocessors or on shared-memory computers that provide automatic task scheduling.
On the other hand, on the server side, multicore processors are ideal because they allow many users to connect to a site simultaneously and have independent threads of execution. This allows for Web servers and application servers that have much better throughput.
codeguru.com
Copyright Internet.com Inc., All Rights Reserved.