FAQ === Why do I get the warning ``Too many open files`` in my log files? ----------------------------------------------------------------- This is happens when your WS exceeds it's open file limit. This limit is set by the OS in ``/etc/security/limits.conf``. You can raise the limit but it is usually a sign that your WS is overloaded and it's network is saturated. In this case, increasing the open file limit will only make things worse. When your WS assigns a WU to a client it is commiting network bandwidth to both upload the WU to the client and later download WU results. If the WS overcommits it's bandwidth, WU transfers will begin to backlog. More and more clients will open connections and transfers will become slower and slower. The solution is to reduce the traffic to your WS to a level it can handle. This can be done by decreasing it weight on the AS. Future versions (> 9.2.x) of the WS will allow you to specify available network bandwidth and the WS will automatically coordiate with the AS to not assign too many WUs. Why are my WUs not getting credited? ------------------------------------ If you have verified that your WUs are being returned to your WS or CS but are not being credited then either: #. The WS is failing to write the credit log. #. The stats system is not collecting your credit log. In the first case the WS should automatically shut itself down when it fails to write the credit log. If this is occuring it will be apparent in the WS log files. Look for error messages. If the credit log is being written a number of possible problems could cause your credit logs to not be collected. a. Your firewall is not allow ssh connections from foldingathome.org b. Your ssh server is not allowing key based ssh authentication. c. The foldingathome.org public key is not allow to ssh to your WS. d. Your WS is not listed in the stats system's server list. See the following sections of the install guilde: - `Open and redirect ports `__ - `Enable key based ssh login `__ - `Have your WS added to the stats system `__ To ensure that your credits are being collected check the credit log file ``CPUtimeinfo.log``. As the WS writes to it, it will grow. If you have missing credits you may be able to find them in this log file. Periodically the stats system should ssh into your WS and download the credit log. When it does it moves the collected credit log to the ``logs`` directory and appends the date and time. Look in the ``logs`` diretory for collected credit logs and look at the time stamps. How do I monitor the jobs available, jobs assigned, error reports, etc.? ------------------------------------------------------------------------ Visit the WS's admin page in your browser here: :: http:///admin The machine you are on must be granted access to view this page by adding it's IP address or IP address range to ``web-allow`` in ``config.xml``. If you visit the admin pages over https you will need to add a browser security exception to tell your browser to accept the SSL certificates generated by the F@H system. Do I need to run both a Work Server (WS) and Collection Server (CS)? -------------------------------------------------------------------- A CS acts as a backup incase your WS goes down so yes you do need to run separate WS and CS on different machins. However, if you have two WS they can act as eachother's CS. Simply add the CS configuration variables (``collect`` & ``collect-allow``) to your WS and restart. Can I change WU points in the middle of a run? ---------------------------------------------- Yes, points can be changed by editing the project configuration and restarting the WS. Why wont my WS start? --------------------- The first thing to do is check the log file by running: :: /etc/init.d/fah-work -name Where ```` is the name of your WS. Look for errors that may tell you why the WS failed. If the above does not help you resolve the problem try running the WS manually as follows: :: cd ~server/server2 fah-work This should run the WS in the foreground. Any errors should be immediately apparent. If the WS does start you can hit CTRL-C to make it exit. Again, look for errors, attempt to correct them and try again. Why does job creation fail? --------------------------- There are a number of reasons why job creation can fail. Always, check the log file for error messages. The most common reasons follow: WS cannot find needed programs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The WS runs as a system script so it does not always have the same ``PATH`` as the user account you use to setup a project. This causes the job creation and next gen commands to fail. The solution is to add the path to the file ``/etc/default/fah-work.`` where ```` is the name your WS was deployed as. For example, your WS is named example and the tools you need were installed in ``/usr/local/gromacs-v5.0.4/bin`` you would edit the file ``/etc/default/fah-work.example`` and add: :: export PATH=/usr/local/gromacs-v5.0.4/bin:$PATH Then stop and restart the WS like this: :: /etc/init.d/fah-work -n example stop /etc/init.d/fah-work -n example start Note it is important that you do a full stop and start rather than a restart so that the WS loads the new ``PATH``. Wrong arguments to creation command ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Double check the arguments you pass to your creation command. Test them out manually. Missing files ~~~~~~~~~~~~~ Double check that all the files are available and accessible. You can see the exact paths the WS is using in the log files. Missing line continuation in command ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The WS will interpret multiple lines in the project command options as multiple commands unless a backslash is used to continue the line. Failing to add the backslash will cause the command to fail. For example: :: gmx grompp \ -f $home/mdp/debug.mdp \ -c $home/runs/$run.gro \ -p $home/runs/$run.top \ -o $jobdir/frame0.tpr \ -po $jobdir/mdout.mdp \ -maxwarn 2 WS unable to connect to CS -------------------------- Check the following: - The WS has been registered with the AS so that it has a valid certificate. - The CS is up and reachable over HTTPs on port 8084. - The CS is reachable from the WS, e.g. pingable. - The CS collector-for option includes the WS' IP or IP range. - The WS collection-servers option contains the correct CS IP address. - The WS Web interface for it's current CS connection status. - The CS log for errors. - The WS log for errors. How do I restart CLONEs stopped due to errors? ---------------------------------------------- The following commands will restart the WS and tell it to reset the error counters of all jobs. :: /etc/init.d/fah-work -n stop /etc/init.d/fah-work -n start -- --clear-job-errors Wait until the WS has fully restarted and reloaded all it's jobs, which you can see in the Web interface. Then restart the WS again: :: /etc/init.d/fah-work -n restart The second WS restart removes the ``clear-job-errors`` option. Otherwise, if an error later caused the WS to restart this option would again reset the error counts. In v9.1+ you can clear the errors of a single job by visiting the Web interface, finding the job under the ``Jobs`` tab and clicking the ``Reset job`` button under the ``Actions`` column on the far right.