29th December 2007 at 6:03 am #14225
Dear Muster team,
You have put an incredibly needed feature into Muster 5.31, but I honestly think you’ve implemented it in a terrible way.
I’ve been laboring to come up with a way to monitor CPU usage across render nodes, but in 5.31, you include CPU usage as part of the node status, which is an absolutely excellent development. You really, really need this.
But then, in order to save a couple of bytes of bandwidth out of the terrabytes of data that pass through the network, you only update CPU usage when the node’s status changes, which to me is bizarre. You literally have to go to each render client an click Refresh to see updated CPU usage status. What??? I mean, what??? This is the same as putting the feature in and then taking it back out!
Even with 10,000 render nodes, updating the CPU status is such a minor thing, like a couple of bytes maybe every 10 seconds. Why on Earth have you put in this feature and then decide to cripple it? It doesn’t make sense at all. Updating CPU status is not a drain on bandwidth, especially if you stagger the requests. Or to put in another way, if checking the CPU usage is a drain, then there’s something very very wrong with the code you’re using to request it.
Apparently this is not a bug, as you’ve described it in the manual. But there is no technical basis for this limitation, none whatsoever. If you simply poll one render client every second, we’re talking about maybe 6 bytes of request data and 40 bytes TCP overhead. Any gigabit network has capacity for maybe 80,000,000-100,000,000 bytes per second. What’s 46 bytes more?
Simply scan a new render client every second. People who have only 5 render clients get updated fast. People with 5,000 render clients simply get updated slower.
But don’t cripple the feature. The CPU Usage feature is, currently, completely useless. But you really need it.
Per29th December 2007 at 6:45 am #14765
I should only add that the way CPU Usage is currently implemented is especially useless, because it only updates exactly at the times that CPU Usage is unrealiable, such as when a new chunk is queued. At the exact time you’re polling the CPU Usage, it tends to be around 20%, because you’re *between chunks*. I don’t think I could come up with a worse way to do it if I tried.
Sorry for the rant. It would be really nice to see this feature live up to its potential.
Per1st January 2008 at 1:34 am #14767
There’s a specific reason for this feature design.
Even you think it transfers a couple of bytes, remember that a TCP packet is always 512 bytes of size.
Apart from that, Muster uses a FIFO network protocol using TCP, that means that every packet is put into a queue waiting for network delivering. Scheduling an automatic message every X seconds may lead to an internal buffer overflow under heavy network usage, even using an intelligent substutable message queue.
Now, thinking at a 200-300 host farm, having those messages in queue isn’t a real problem, but the delay it may cause to most important messages (completation of renders) may lead to lost of precious seconds.
Don’t think in a “gigabit” way. When under usage, there’s a good chance that your network is in a “full” status at about 80% of the rendering times due to file streaming. The only way to make that really working in the way you’re describing is the define of a local policy of QoS (Quality of service) directly on your switches (that must be layer 3). Unfortunately, this is something that’s done by a very low level of end users.
Just to close the case, we’ll make that feature user-configurable in the near future.
Cheers11th January 2008 at 12:26 am #14777
Another option is to do as suggested above, so simply set a maximum number of polls per second. Then if it’s set to 1, only one node will be polled for CPU usage every second, in a circular rotation, so if you have 100 render nodes, it will take 100 seconds for all CPU usages to be updated.
Then if you set it to 10 per second, then all CPU usages will be current after 10 seconds with 100 nodes.
I think the problem you were having design-wise might have been that if you wanted to poll a hundreds render-clients every second, then you’re right, this is no longer an insignificant amount of data. But 2 or 3 nodes per second should be totally doable.
Thanks for putting it in!
You must be logged in to reply to this topic.