Error: Ran out of slaves to run tasks
Dear Modellers,
When I was refining loops for a model in parallel, I asked modeller to generate 9999 models. But modeller stopped at 1245 with the following error. The computer has six cores, modeller uses four of them. The other two cores were free.
Thank you for your help.
<Slave on localhost> failed (Connection lost to slave <Slave on localhost>: [Errno 104] Connection reset by peer) - removing from <Parallel job [<Slave on localhost>, <Slave on localhost>, <Slave on localhost>, <Slave on localhost>]> <Loop model building task #1.1219> on <Slave on localhost> completed <Loop model building task #1.1217> on <Slave on localhost> completed <Loop model building task #1.1222> on <Slave on localhost> completed <Loop model building task #1.1218> on <Slave on localhost> completed <Loop model building task #1.1223> on <Slave on localhost> completed <Loop model building task #1.1224> on <Slave on localhost> completed <Loop model building task #1.1225> on <Slave on localhost> completed <Loop model building task #1.1226> on <Slave on localhost> completed <Loop model building task #1.1228> on <Slave on localhost> completed <Loop model building task #1.1229> on <Slave on localhost> completed <Slave on localhost> failed (Connection lost to slave <Slave on localhost>: [Errno 104] Connection reset by peer) - removing from <Parallel job [<Slave on localhost>, <Slave on localhost>, <Slave on localhost>, <Slave on localhost>]> <Loop model building task #1.1227> on <Slave on localhost> completed <Loop model building task #1.1230> on <Slave on localhost> completed <Loop model building task #1.1232> on <Slave on localhost> completed <Loop model building task #1.1233> on <Slave on localhost> completed <Loop model building task #1.1234> on <Slave on localhost> completed <Slave on localhost> failed (Connection lost to slave <Slave on localhost>: [Errno 104] Connection reset by peer) - removing from <Parallel job [<Slave on localhost>, <Slave on localhost>, <Slave on localhost>, <Slave on localhost>]> <Loop model building task #1.1235> on <Slave on localhost> completed <Loop model building task #1.1237> on <Slave on localhost> completed <Loop model building task #1.1238> on <Slave on localhost> completed <Loop model building task #1.1239> on <Slave on localhost> completed <Loop model building task #1.1240> on <Slave on localhost> completed <Loop model building task #1.1241> on <Slave on localhost> completed <Loop model building task #1.1242> on <Slave on localhost> completed <Loop model building task #1.1243> on <Slave on localhost> completed <Loop model building task #1.1244> on <Slave on localhost> completed <Loop model building task #1.1245> on <Slave on localhost> completed <Slave on localhost> failed (Connection lost to slave <Slave on localhost>: [Errno 104] Connection reset by peer) - removing from <Parallel job [<Slave on localhost>, <Slave on localhost>, <Slave on localhost>, <Slave on localhost>]> Traceback (most recent call last): File "refine_loop.py", line 41, in <module> a.make() File "/usr/local/lib/python2.7/site-packages/modeller/automodel/loopmodel.py", line 36, in make self.build_seq(self.inimodel, 1) File "/usr/local/lib/python2.7/site-packages/modeller/automodel/loopmodel.py", line 190, in build_seq self.parallel_loop_models(atmsel, ini_model, num, sched) File "/usr/local/lib/python2.7/site-packages/modeller/automodel/loopmodel.py", line 208, in parallel_loop_models self.loop.outputs.extend(job.run_all_tasks()) File "/usr/local/lib/python2.7/site-packages/modeller/parallel/job.py", line 136, in run_all_tasks raise ValueError("Ran out of slaves to run tasks") ValueError: Ran out of slaves to run tasks
XP
On 02/21/2012 08:55 AM, Xiao-Ping Zhang wrote: > When I was refining loops for a model in parallel, I asked modeller to > generate 9999 models. But modeller stopped at 1245 with the following > error. The computer has six cores, modeller uses four of them. The other > two cores were free. ... > ValueError: Ran out of slaves to run tasks
This means exactly what it says: all of the slaves died, so it had nowhere to run the loop model building tasks. Each slave generates its own output file (look for files ending in .slave). Look in there to see what the problem was with each slave.
Ben Webb, Modeller Caretaker
On 2/21/12 9:39 AM, Modeller Caretaker wrote: > On 02/21/2012 08:55 AM, Xiao-Ping Zhang wrote: >> When I was refining loops for a model in parallel, I asked modeller to >> generate 9999 models. But modeller stopped at 1245 with the following >> error. The computer has six cores, modeller uses four of them. The other >> two cores were free. > ... >> ValueError: Ran out of slaves to run tasks > > This means exactly what it says: all of the slaves died, so it had > nowhere to run the loop model building tasks. Each slave generates its > own output file (look for files ending in .slave). Look in there to see > what the problem was with each slave.
To conclude: it turned out that each slave was running out of memory. This is actually caused by a memory leak in Modeller that only affects parallel loopmodel runs. A patch is available to fix the problem at http://salilab.org/modeller/wiki/Patches
Ben Webb, Modeller Caretaker
Dear Ben,
The patch fixed the problem. CPU's Memory usage is stable, ~14.3 MB per core. Great!
Thank you very much!
Xiao-Ping
On Thu, 2012-02-23 at 10:54 -0800, Modeller Caretaker wrote: > On 2/21/12 9:39 AM, Modeller Caretaker wrote: > > On 02/21/2012 08:55 AM, Xiao-Ping Zhang wrote: > >> When I was refining loops for a model in parallel, I asked modeller to > >> generate 9999 models. But modeller stopped at 1245 with the following > >> error. The computer has six cores, modeller uses four of them. The other > >> two cores were free. > > ... > >> ValueError: Ran out of slaves to run tasks > > > > This means exactly what it says: all of the slaves died, so it had > > nowhere to run the loop model building tasks. Each slave generates its > > own output file (look for files ending in .slave). Look in there to see > > what the problem was with each slave. > > To conclude: it turned out that each slave was running out of memory. > This is actually caused by a memory leak in Modeller that only affects > parallel loopmodel runs. A patch is available to fix the problem at > http://salilab.org/modeller/wiki/Patches > > Ben Webb, Modeller Caretaker
Hi,
I got the following error when trying to run a old python script (see below) that worked very well before. A couple of days ago, I removed the Java came from the system (Fedora 13) and installed Java from java.com (jdk-7u4-linux-i586.rpm, jre-7u4-linux-i586.rpm) to fix a crash problem of Jalview.
I wonder if the problem is related to the new Java or some other problems.
Thank you.
XP
############ my script ########################
# Homology modeling by the automodel class from modeller import * # Load standard Modeller classes from modeller.automodel import * # Load the automodel class from modeller.parallel import * # Load the parallel class, #to use multiple processors
# Use 5 CPUs in a parallel job on this machine j = job() # Cluster j.append(local_slave()) j.append(local_slave()) j.append(local_slave()) j.append(local_slave()) j.append(local_slave())
log.verbose() # request verbose output env = environ() # create a new MODELLER environment to build this model in env.io.hetatm = True
# directories for input atom files env.io.atom_files_directory = ['.', '/usr/share/doc/modeller-9v8/examples']
a = automodel(env, alnfile = 'CC1295N.ali', # alignment filename knowns = ('1H6L', '1CVM1', '3AMR'), # codes of the templates sequence = 'CC1295N') # code of the target a.starting_model= 1 # index of the first model a.ending_model = 20 # index of the last model # (determines how many models to calculate) a.use_parallel_job(j) a.make() # do the actual homology modeling
#################### Errors ######################## Traceback (most recent call last): File "CC1295N_model.py", line 32, in ? a.make() # do the actual homologymodeling File "/usr/lib/modeller9.10/modlib/modeller/automodel/automodel.py", line 107, in make self.multiple_models(atmsel) File "/usr/lib/modeller9.10/modlib/modeller/automodel/automodel.py", line 208, in multiple_models self.parallel_multiple_models(atmsel) File "/usr/lib/modeller9.10/modlib/modeller/automodel/automodel.py", line 228, in parallel_multiple_models self.outputs.extend(job.run_all_tasks()) File "/usr/lib/modeller9.10/modlib/modeller/parallel/job.py", line 131, in run_all_tasks for task in self._finish_all_tasks(): File "/usr/lib/modeller9.10/modlib/modeller/parallel/job.py", line 164, in _finish_all_tasks task = self._process_event(obj, s) File "/usr/lib/modeller9.10/modlib/modeller/parallel/job.py", line 180, in _process_event task = obj.task_results() File "/usr/lib/modeller9.10/modlib/modeller/parallel/slave.py", line 61, in task_results r = self.get_data(allow_heartbeat=True) File "/usr/lib/modeller9.10/modlib/modeller/parallel/communicator.py", line 89, in get_data (cmdtype, obj) = self._recv() File "/usr/lib/modeller9.10/modlib/modeller/parallel/communicator.py", line 130, in _recv raise RemoteError(obj.exc, self) modeller.parallel.communicator.RemoteError: IndexError: user_form__E> Functional form 8195 out of range from <Slave on localhost> #####################################################
-----Original Message----- From: Modeller Caretaker modeller-care@salilab.org To: modeller_usage@salilab.org Subject: Re: [modeller_usage] Error: Ran out of slaves to run tasks Date: Thu, 23 Feb 2012 10:54:11 -0800
On 2/21/12 9:39 AM, Modeller Caretaker wrote: > On 02/21/2012 08:55 AM, Xiao-Ping Zhang wrote: >> When I was refining loops for a model in parallel, I asked modeller to >> generate 9999 models. But modeller stopped at 1245 with the following >> error. The computer has six cores, modeller uses four of them. The other >> two cores were free. > ... >> ValueError: Ran out of slaves to run tasks > > This means exactly what it says: all of the slaves died, so it had > nowhere to run the loop model building tasks. Each slave generates its > own output file (look for files ending in .slave). Look in there to see > what the problem was with each slave.
To conclude: it turned out that each slave was running out of memory. This is actually caused by a memory leak in Modeller that only affects parallel loopmodel runs. A patch is available to fix the problem at http://salilab.org/modeller/wiki/Patches
Ben Webb, Modeller Caretaker
On 05/15/2012 05:59 PM, Xiao-Ping Zhang wrote: > I got the following error when trying to run a old python script (see > below) that worked very well before. A couple of days ago, I removed the > Java came from the system (Fedora 13) and installed Java from java.com > (jdk-7u4-linux-i586.rpm, jre-7u4-linux-i586.rpm) to fix a crash problem > of Jalview. > > I wonder if the problem is related to the new Java or some other > problems.
Modeller doesn't use Java, so I doubt this is the cause of your problem.
> modeller.parallel.communicator.RemoteError: IndexError: user_form__E> > Functional form 8195 out of range from<Slave on localhost>
OK, so the master is telling you that one of the slaves encountered a problem. You can look in the .slave output files to find which slave it was - there may be some more information in there. The error suggests that your restraints file is corrupted - maybe a filesystem problem such as the disk filling up or quota exceeded?
Ben Webb, Modeller Caretaker
participants (2)
-
Modeller Caretaker
-
Xiao-Ping Zhang