FAQ
===

Why do I get the warning ``Too many open files`` in my log files?
-----------------------------------------------------------------

This is happens when your WS exceeds it's open file limit. This limit is set
by the OS in ``/etc/security/limits.conf``. You can raise the limit but
it is usually a sign that your WS is overloaded and it's network is saturated.
In this case, increasing the open file limit will only make things
worse.

When your WS assigns a WU to a client it is commiting network bandwidth to both
upload the WU to the client and later download WU results. If the WS
overcommits it's bandwidth, WU transfers will begin to backlog. More and more
clients will open connections and transfers will become slower and slower.

The solution is to reduce the traffic to your WS to a level it can handle.
This can be done by decreasing it weight on the AS. Future versions (> 9.2.x)
of the WS will allow you to specify available network bandwidth and the WS
will automatically coordiate with the AS to not assign too many WUs.

Why are my WUs not getting credited?
------------------------------------

If you have verified that your WUs are being returned to your WS or CS but are
not being credited then either:

#. The WS is failing to write the credit log.
#. The stats system is not collecting your credit log.

In the first case the WS should automatically shut itself down when it fails
to write the credit log. If this is occuring it will be apparent in the WS log
files. Look for error messages.

If the credit log is being written a number of possible problems could cause
your credit logs to not be collected.

a. Your firewall is not allow ssh connections from foldingathome.org
b. Your ssh server is not allowing key based ssh authentication.
c. The foldingathome.org public key is not allow to ssh to your WS.
d. Your WS is not listed in the stats system's server list.

See the following sections of the install guilde:

-  `Open and redirect
   ports <https://github.com/FoldingAtHome/fah-work/blob/master/INSTALL.md#open-and-redirect-ports>`__
-  `Enable key based ssh
   login <https://github.com/FoldingAtHome/fah-work/blob/master/INSTALL.md#enable-key-based-ssh-login>`__
-  `Have your WS added to the stats
   system <https://github.com/FoldingAtHome/fah-work/blob/master/INSTALL.md#have-your-ws-added-to-the-stats-system>`__

To ensure that your credits are being collected check the credit log file
``CPUtimeinfo.log``. As the WS writes to it, it will grow. If you have missing
credits you may be able to find them in this log file. Periodically the stats
system should ssh into your WS and download the credit log. When it does it
moves the collected credit log to the ``logs`` directory and appends the date
and time. Look in the ``logs`` diretory for collected credit logs and look at
the time stamps.

How do I monitor the jobs available, jobs assigned, error reports, etc.?
------------------------------------------------------------------------

Visit the WS's admin page in your browser here:

::

    http://<ws host name>/admin

The machine you are on must be granted access to view this page by adding it's
IP address or IP address range to ``web-allow`` in ``config.xml``.

If you visit the admin pages over https you will need to add a browser security
exception to tell your browser to accept the SSL certificates generated by the
F@H system.

Do I need to run both a Work Server (WS) and Collection Server (CS)?
--------------------------------------------------------------------

A CS acts as a backup incase your WS goes down so yes you do need to run
separate WS and CS on different machins. However, if you have two WS they can
act as eachother's CS. Simply add the CS configuration variables
(``collect`` & ``collect-allow``) to your WS and restart.

Can I change WU points in the middle of a run?
----------------------------------------------

Yes, points can be changed by editing the project configuration and
restarting the WS.

Why wont my WS start?
---------------------

The first thing to do is check the log file by running:

::

    /etc/init.d/fah-work -name <name>

Where ``<name>`` is the name of your WS. Look for errors that may tell
you why the WS failed.

If the above does not help you resolve the problem try running the WS
manually as follows:

::

    cd ~server/server2
    fah-work

This should run the WS in the foreground. Any errors should be immediately
apparent. If the WS does start you can hit CTRL-C to make it exit. Again,
look for errors, attempt to correct them and try again.

Why does job creation fail?
---------------------------

There are a number of reasons why job creation can fail. Always, check the log
file for error messages. The most common reasons follow:

WS cannot find needed programs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The WS runs as a system script so it does not always have the same ``PATH`` as
the user account you use to setup a project. This causes the job creation and
next gen commands to fail. The solution is to add the path to the file
``/etc/default/fah-work.<name>`` where ``<name>`` is the name your WS was
deployed as. For example, your WS is named example and the tools you need were
installed in ``/usr/local/gromacs-v5.0.4/bin`` you would edit the file
``/etc/default/fah-work.example`` and add:

::

    export PATH=/usr/local/gromacs-v5.0.4/bin:$PATH

Then stop and restart the WS like this:

::

    /etc/init.d/fah-work -n example stop
    /etc/init.d/fah-work -n example start

Note it is important that you do a full stop and start rather than a restart so
that the WS loads the new ``PATH``.

Wrong arguments to creation command
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Double check the arguments you pass to your creation command. Test them out
manually.

Missing files
~~~~~~~~~~~~~

Double check that all the files are available and accessible. You can
see the exact paths the WS is using in the log files.

Missing line continuation in command
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The WS will interpret multiple lines in the project command options as multiple
commands unless a backslash is used to continue the line. Failing to add the
backslash will cause the command to fail. For example:

::

    <create-command>
      gmx grompp \
        -f $home/mdp/debug.mdp \
        -c $home/runs/$run.gro \
        -p $home/runs/$run.top \
        -o $jobdir/frame0.tpr \
        -po $jobdir/mdout.mdp \
        -maxwarn 2
    </create-command>

WS unable to connect to CS
--------------------------

Check the following:

- The WS has been registered with the AS so that it has a valid certificate.
- The CS is up and reachable over HTTPs on port 8084.
- The CS is reachable from the WS, e.g. pingable.
- The CS collector-for option includes the WS' IP or IP range.
- The WS collection-servers option contains the correct CS IP address.
- The WS Web interface for it's current CS connection status.
- The CS log for errors.
- The WS log for errors.

How do I restart CLONEs stopped due to errors?
----------------------------------------------

The following commands will restart the WS and tell it to reset the error
counters of all jobs.

::

    /etc/init.d/fah-work -n <name> stop
    /etc/init.d/fah-work -n <name> start -- --clear-job-errors

Wait until the WS has fully restarted and reloaded all it's jobs, which you
can see in the Web interface. Then restart the WS again:

::

    /etc/init.d/fah-work -n <name> restart

The second WS restart removes the ``clear-job-errors`` option. Otherwise, if
an error later caused the WS to restart this option would again reset the error
counts.

In v9.1+ you can clear the errors of a single job by visiting the Web interface,
finding the job under the ``Jobs`` tab and clicking the ``Reset job`` button
under the ``Actions`` column on the far right.