Virtual Vertex | dropped nodes and chunks looping

This topic has 6 replies, 4 voices, and was last updated 18 years, 6 months ago by Leonardo Bernardini.

Viewing 7 posts - 1 through 7 (of 7 total)

Back to forum

Posted in: Muster usage
ironhalo

7th August 2007 at 5:32 pm #14202

hello,

we have a problem where chunks will be sent to a render node, render successfully and upon completion be requeued. in the chunk detail window, it goes from yellow (submitted) to red, even while its still processing in the farm.

i’ve also noticed that i’ll lost nodes in the explorer window, but they come back. i’m not sure if this is related or why this occurs. sone scenes render just fine, while others experience this error. we’re running muster 5.2.

Leonardo Bernardini

10th August 2007 at 1:16 am #14722

This may be dued to the keep alive feature. Check the dispatcher.log and the renderclient.log file and increase the heartbeat/keep alive feature according.

This kind of functionaly has gone under revision, Muster 5.3 that has just been posted addresses some issues on the keep alive.

Regards.

mwilliams

10th August 2007 at 10:58 pm #14723

we’ve been having trouble with dropped nodes, as well. I’m using Muster 5.3 – are you saying that I should increase the length of the heartbeat and the keep alive, or decrease it? I have about 230 nodes and having a lot of extra traffic between explorer and the render nodes seems to cause the dispatcher’s cpu usage to max out, making Muster unresponsive. Sometimes the nodes drop out before the render clients have completely loaded the scene – they come back, but thier chunks will have to be requed. Is there something I can do to my dispatcher network settings to stabilize this?

Leonardo Bernardini

11th August 2007 at 11:56 pm #14724

The keepalive/heartbeat feature is required by Muster to be warn if a node becames unresponsive. It’s a simple pulse that’s sent from the client to the server each X seconds, then the server sends an ACK back.

If networks are overloaded, it may be possible that this “pulse” gets a big amount of time to get back. It that case, there are two solutions:

– Raise the settings both on the Dispatcher and the renderclients (the Dispatcher value should be the double of the client value)

– Disable the heartbeat function both on the dispatcher and on the clients.

Leonardo Bernardini

13th August 2007 at 3:50 pm #14726

An additional note about this topic just to clarify how Muster works and some hints about high network usage scenarious:

As most of you know, Muster works with TCP connections. TCP connections are stream of bytes sent across the network that guarantee message delivering and bytes ordering.

The amount of data that Muster streams across the network has nothing to do with your renders, it’s related to controlling/synching messages used to keep track of render status, update the queue status and so on.

Unfortunately for us, the TCP protocol doesn’t provide a reliable way to detect broken connections (test it, if you cut your network cable O_O) your connections will stay alive unless the OS drop down all the active connections (like in happens on Windows).

That’s why we implemented the heartbeat feature in Muster, it’s just a pulse message that’s sent by the Renderclients to notify the Dispatcher they are still alive.

Unfortunately, even this method can raise problems. During a heavy network usage scenario, your entire network is flooded by packets belonging to your scene files and under direct control of your batch renders. This means there’s no way to tune the network usage and in such situation, a control/update message generated by Muster may take ages to be delivered. This includes the hb “pulse” messages as well as updates to the explorer’s queue and controlling messages (pause/reinit/kill)…..

Raising the heartbeat setting is a solution but this not improve your system usage, let’s look at a typical scenario:

You’ve 100 hosts currently rendering, due to the network usage, a controlling message may take up to 5 minutes to reach the Dispatcher. When a render is completed, an host sends a “render done” message to the Dispatcher and wait for further commands. This 5 minutes delay means that you’ll end with a render host idle for at least 5 minutes waiting for further commands, result: 5 minutes of wasted rendering times….

Improving the network bandwidth doesn’t solve the problems, if your jobs are big enough, they may be able to saturate even the fastest fiber connection.

There’re two hints users running big environments may follow:

1) Tune your network using QoS (Quality of services) tools. Those tools let you specify which kind of traffic should have priority. Giving full priority to the Muster TCP packets running on the Muster ports (7680, 7681, 7683, 7690) will make sure controlling messages will be delivering almost realtime. Do not worry about the network performances, the Muster network usage is really low. On a 100 hosts environment, It may reach a maximum of an average 1Kb/s on a standard idle usage.

2) If you own layer 4 switchs, the priority may be set at switch level, this will give you the same results of point 1.

3) Use a dual network setup. By using 2 network cards on each host, you can route the Muster controlling messages on a network, and the file streaming on the second. This may be done easily assigning different network masks like:

controlling network: 192.168.1.x
file-stream network 192.168.2.x

If you tell Muster to connect to the Dispatcher, i.e. to 192.168.1.1 and access the files on \192.168.2.1myreposity, you’re forcing a network routing that splits the messages….

Alternatively, you can setup the routing tables on your hosts, this may be a little more complicated….

Hope this helps the ones experiencing issues.

Christopher Gaal

30th September 2007 at 7:46 pm #14735

We’re having similar problems with chunks timing out.

I tried turning off the heartbeat function on the dispatcher and render nodes altogether, but some are still timing out. How is the dispatcher deciding what is timed out and what isn’t?

The renders are actually processing correctly. The problem is that it will start the chunk over and we lose a lot of time with the re-renders. I couldn’t find much detail in the dispatcher log as to what it’s doing.

Thanks

Leonardo Bernardini

6th October 2007 at 9:53 am #14738

Timeouts are handled by the heartbeat, double check your config, you shouldn’t get any kind of timeout if hb is disabled…

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic.