FAQ
Why do I get the warning Too many open files
in my log files?
This is happens when your WS exceeds it’s open file limit. This limit is set
by the OS in /etc/security/limits.conf
. You can raise the limit but
it is usually a sign that your WS is overloaded and it’s network is saturated.
In this case, increasing the open file limit will only make things
worse.
When your WS assigns a WU to a client it is commiting network bandwidth to both upload the WU to the client and later download WU results. If the WS overcommits it’s bandwidth, WU transfers will begin to backlog. More and more clients will open connections and transfers will become slower and slower.
The solution is to reduce the traffic to your WS to a level it can handle. This can be done by decreasing it weight on the AS. Future versions (> 9.2.x) of the WS will allow you to specify available network bandwidth and the WS will automatically coordiate with the AS to not assign too many WUs.
Why are my WUs not getting credited?
If you have verified that your WUs are being returned to your WS or CS but are not being credited then either:
The WS is failing to write the credit log.
The stats system is not collecting your credit log.
In the first case the WS should automatically shut itself down when it fails to write the credit log. If this is occuring it will be apparent in the WS log files. Look for error messages.
If the credit log is being written a number of possible problems could cause your credit logs to not be collected.
Your firewall is not allow ssh connections from foldingathome.org
Your ssh server is not allowing key based ssh authentication.
The foldingathome.org public key is not allow to ssh to your WS.
Your WS is not listed in the stats system’s server list.
See the following sections of the install guilde:
To ensure that your credits are being collected check the credit log file
CPUtimeinfo.log
. As the WS writes to it, it will grow. If you have missing
credits you may be able to find them in this log file. Periodically the stats
system should ssh into your WS and download the credit log. When it does it
moves the collected credit log to the logs
directory and appends the date
and time. Look in the logs
diretory for collected credit logs and look at
the time stamps.
How do I monitor the jobs available, jobs assigned, error reports, etc.?
Visit the WS’s admin page in your browser here:
http://<ws host name>/admin
The machine you are on must be granted access to view this page by adding it’s
IP address or IP address range to web-allow
in config.xml
.
If you visit the admin pages over https you will need to add a browser security exception to tell your browser to accept the SSL certificates generated by the F@H system.
Do I need to run both a Work Server (WS) and Collection Server (CS)?
A CS acts as a backup incase your WS goes down so yes you do need to run
separate WS and CS on different machins. However, if you have two WS they can
act as eachother’s CS. Simply add the CS configuration variables
(collect
& collect-allow
) to your WS and restart.
Can I change WU points in the middle of a run?
Yes, points can be changed by editing the project configuration and restarting the WS.
Why wont my WS start?
The first thing to do is check the log file by running:
/etc/init.d/fah-work -name <name>
Where <name>
is the name of your WS. Look for errors that may tell
you why the WS failed.
If the above does not help you resolve the problem try running the WS manually as follows:
cd ~server/server2
fah-work
This should run the WS in the foreground. Any errors should be immediately apparent. If the WS does start you can hit CTRL-C to make it exit. Again, look for errors, attempt to correct them and try again.
Why does job creation fail?
There are a number of reasons why job creation can fail. Always, check the log file for error messages. The most common reasons follow:
WS cannot find needed programs
The WS runs as a system script so it does not always have the same PATH
as
the user account you use to setup a project. This causes the job creation and
next gen commands to fail. The solution is to add the path to the file
/etc/default/fah-work.<name>
where <name>
is the name your WS was
deployed as. For example, your WS is named example and the tools you need were
installed in /usr/local/gromacs-v5.0.4/bin
you would edit the file
/etc/default/fah-work.example
and add:
export PATH=/usr/local/gromacs-v5.0.4/bin:$PATH
Then stop and restart the WS like this:
/etc/init.d/fah-work -n example stop
/etc/init.d/fah-work -n example start
Note it is important that you do a full stop and start rather than a restart so
that the WS loads the new PATH
.
Wrong arguments to creation command
Double check the arguments you pass to your creation command. Test them out manually.
Missing files
Double check that all the files are available and accessible. You can see the exact paths the WS is using in the log files.
Missing line continuation in command
The WS will interpret multiple lines in the project command options as multiple commands unless a backslash is used to continue the line. Failing to add the backslash will cause the command to fail. For example:
<create-command>
gmx grompp \
-f $home/mdp/debug.mdp \
-c $home/runs/$run.gro \
-p $home/runs/$run.top \
-o $jobdir/frame0.tpr \
-po $jobdir/mdout.mdp \
-maxwarn 2
</create-command>
WS unable to connect to CS
Check the following:
The WS has been registered with the AS so that it has a valid certificate.
The CS is up and reachable over HTTPs on port 8084.
The CS is reachable from the WS, e.g. pingable.
The CS collector-for option includes the WS’ IP or IP range.
The WS collection-servers option contains the correct CS IP address.
The WS Web interface for it’s current CS connection status.
The CS log for errors.
The WS log for errors.
How do I restart CLONEs stopped due to errors?
The following commands will restart the WS and tell it to reset the error counters of all jobs.
/etc/init.d/fah-work -n <name> stop
/etc/init.d/fah-work -n <name> start -- --clear-job-errors
Wait until the WS has fully restarted and reloaded all it’s jobs, which you can see in the Web interface. Then restart the WS again:
/etc/init.d/fah-work -n <name> restart
The second WS restart removes the clear-job-errors
option. Otherwise, if
an error later caused the WS to restart this option would again reset the error
counts.
In v9.1+ you can clear the errors of a single job by visiting the Web interface,
finding the job under the Jobs
tab and clicking the Reset job
button
under the Actions
column on the far right.