Corresponding parent ticket: [[!tails_ticket 5734]]
* Some pieces of our infrastructure are critical to e.g.:
- the development process (if the ISO build fails, developers
- the release process -- which may block us from putting out
emergency security fixes
- users (if the APT repository is down, the "additional software
packages" persistence feature is broken)
* We want to avoid contributors getting used to ignore alerts sent by
our CI system. The more false positives there are, the more they
will "learn" to do so. Here we want to diminish the rate of false
positives caused by malfunctioning infrastructure.
* We want to shorten the dev/feedback loop for sysadmins when they
deploy changes, and also when changes are automatically applied
(e.g. Puppet agent passes, or automatic APT upgrades).
* We want to be notified when a service we run doesn't come back up
properly post-reboot, without having to manually test every service.
* We want to minimize the rate of non-sysadmins discovering and
reporting problems _first_, that is before we learn about it.
This is highly subjective, but replying "we're aware of this problem
and are working on it" is much more confidence inspiring than
"really, it's broken?"
Here, we call:
* _machine_: a computer (be it bare metal or physical) and its
* _monitored machine_: a machine we monitor
* _monitoring machine_: the machine(s) that monitors the... _monitored
* _monitoring system_, or _monitoring setup_: all the software
components that we run so that the monitoring machine can monitor
the monitored ones, and their configuration
Note that the monitoring machine may very well be, at the same time,
itself be monitored (be it by itself, or by another monitoring
## Human interface
The monitoring system:
* MUST send email notifications to the sysadmin(s) in charge, to lower
* MUST offer an overview of the status of our systems, via a web
interface that works within Tor Browser with the security slider set
* MAY additionally offer a read-only version of this overview, that we
MAY want to make available to selected contributors, or anonymous
users. Needless to say, this must be carefully balanced with the
security implications of such a system (in other words, a set of
exported static HTML pages is totally fine, but a huge dynamic web
application is probably a no-starter).
* MUST support configuring, with a per-check/per-service granularity,
a threshold of N failures _in a row_ before an alert is raised.
Still, it SHOULD support triggering alerts depending on the
frequency of such failures, even when they never fail twice in a row
(we don't want to miss the fact that `$service` is down for
5 minutes every day). Implementation details may vary, but you get
## Threat model
### Compromised monitored machine
* We do not try to avoid the fact that it can report wrong information
(this includes missing information) about itself.
* It MUST NOT result in a compromise of the monitoring machine.
* It MUST NOT be able to DoS the sysadmin(s) in charge, e.g.
by flooding them with alerts.
* It MUST NOT result in a compromise of the network traffic between
other monitored machines and the monitoring machine (e.g. if that
traffic is encrypted, the monitored machines MUST NOT use the same
* It SHOULD NOT be able to alter the information about other
### Compromised monitoring machine
* We do not try to avoid the fact that it can DoS the sysadmin(s) in
charge, e.g. by flooding them with alerts.
* We do not try to avoid the fact that it can report wrong information
about the monitored machines.
* It MUST NOT be able to run arbitrary code as root on any of the
* It SHOULD NOT be able to run arbitrary code as a non-privileged user
on any of the monitored machines.
### Network attacker
Here, we consider an attacker that may be active or passive, and can
sit at any point they choose on the Internet.
We accept the risk that a network attacker:
* can enumerate the machines and services we monitor;
* can view the reports, test results, and any such information about
monitored services, that the monitoring system needs to learn; this
of course implies that we should be careful about what kind of
information flows this way: it MUST NOT be a big deal if it leaks
into the hands of an adversary;
* can DoS our monitoring, e.g. by blocking network connections;
* can spoof the reports, test results and alike about monitored
services that a client has no credible means to authenticate.
However, a network attacker:
* SHOULD NOT be able to spoof the reports, test results and alike
that monitored machines send about themselves;
* MUST NOT be able to run arbitrary code on the monitored machines;
* MUST NOT be able to run arbitrary code on the monitoring machine.
## Availability, sustainability
Here, we assume that the entire monitoring system has both software
components that run on the monitored machines (that we call the
"agent"), and software components that run on the monitoring machine
(that we call the "server"). Below, the _agent_ implicitly includes
anything needed for basic usage (plugins, checks, whatever); and
similarly, the _server_ implicitly includes its web interface, and
anything needed for basic usage (plugins, checks, etc.).
* The agent MUST be usually available in all of Debian oldstable,
stable, and testing -- possibly thanks to _pre-existing_ and
well-maintained official backports. All these versions of the agent
MUST be compatible with the chosen version of the server.
* The server MUST be usually available either in current Debian stable
(Jessie), or in current Debian testing (Stretch). We are considering
running the version from Debian testing mainly because it might
avoid having to go through a costly upgrade process in a couple
years, e.g. to switch to the next major, incompatible version of
* Both the agent and the server MUST be actively maintained in all the
versions of Debian we care about (see above). Hint: this excludes
* Both the agent and the server MUST be DFSG-free.
* For all involved software, the upstream project MUST be mature and
active. It MUST have a confidence inspiring future. We can't afford
having to migrate to a totally different monitoring setup in three
years, to the extent that this can be foreseen. Hint: given Nagios 4
is not an option (see above), this in turn excludes all older
versions of Nagios.
* It SHOULD be realistically possible for external contributors to
have patches merged into the upstream codebase of the
* All the involved softwares MUST have a not-too-scary security
Here, we have two major desires. One is the ability for humans to
easily review the monitoring system's configuration, or changes
proposed to it, so that contributions are made easier. The other is
the ability to include monitoring aspects within the description of
the services we run, in a self-contained way, so that describing them
in puppet is easier. Note that a system that satisfies the second
requirement has great chances to also mostly satisfy the first one as
The chosen monitoring system:
* SHOULD allow encoding, in the description of a service (read: in the
corresponding Puppet class), how it needs to be monitored.
- Additionally, if this optional (but warmly welcome) requirement is
satisfied, then the "shared Puppet modules" we use SHOULD already
support the chosen monitoring system (hint: in practice, this
means something compatible with Nagios).
- Note: this gives us for free the ability to review the monitoring
configuration for service checks, but it is unrelated to our
ability to review the global configuration of the server
components, that run on the monitoring machine.
* SHOULD allow humans to easily review the service checks
configuration. Really, that's a *strong* SHOULD. A system that
doesn't make this possible will need to have very serious advantages
in other areas to be attractive to us.
* SHOULD allow humans to review the global configuration of the server
components, that run on the monitoring machine. This assumes that
said configuration is mostly static, and is unaffected when adding
or modifying service checks.
## Adequacy to our resources
Being able to operate the monitoring system for 20-50 monitored
systems MUST NOT require Tails sysadmins to invest lots of time and
become experts at hand-holding a complex software stack: the main
focus of our system and automation engineers shall not become
monitoring. For example, we won't like a monitoring system that is
trivial to set up for monitoring 5-10 hosts, but requires adding more
and more moving parts and complex optional components to be able to
scale up to 50 hosts.
* We run Tor hidden services, that we want to monitor, so the
monitoring system MUST allow using a configured SOCKS proxy for
specific checks (worst case, for _all_ checks, but it prevents us
from). Wrapping checks with `torsocks` might be an acceptable
option, depending on how involved and hackish this would be. Ability
to retry and not notify on first error is interesting here.
## Hosting of the monitoring machine
* The monitoring machine MUST be a virtual machine.
* We MUST be enabled to admin the OS of the monitoring machine
ourselves: we need to be root, we need to have a Puppet agent that
talks to our own puppetmaster, we want to do the initial
* The monitoring machine MUST be hosted on infrastructure managed by
people the Tails sysadmins trust quite a bit.
* The people who manage the underlying hardware and infrastructure
MUST be reactive and easy to get in touch with.
* We MUST be given out-of-band access to the monitoring machine.
* The monitoring machine MUST have unfiltered access to the Internet,
and SHOULD be assigned at least one public IPv4 address.
* Hosting MUST be affordable (say, max. 20€/month).
* The monitoring machine SHOULD allow at least some flexibility
regarding future "hardware" upgrades (e.g. allocating more disk
space, memory, CPU cores).
* TODO: exact hardware specifications, depending on the chosen
monitoring system. Let's keep in mind that collecting exported
Puppet resources is expensive.
# Service and system checks
Below, HIGH, MEDIUM and LOW are priority level wrt. the implementation
of such checks.
For description of individual services, see
## All systems
* HIGH: up and running!
* HIGH: disk space usage (bytes and inodes)
* HIGH: memory usage
* MEDIUM: Puppet agent last run
* MEDIUM: APT indices (aka. `apt-get update` was successfully run recently)
* MEDIUM: `systemctl is-system-running` (see [[!tails_ticket 8262]])
## APT repository
* CRITICAL: `stable` APT suite over HTTP
* CRITICAL: freezable APT repository, once it exists
* MEDIUM: compare `getblockcount` with what the Internet says it
should be (probably requires exporting the output of `bitcoin-cli
getblockcount` to a place that's readable by the monitoring agent)
* LOW: last Tails release is seeded
* MEDIUM: `git pull` or `git clone` a test repository over all
supported protocols (currently: `git://` and SSH)
* HIGH: our Tor Browser archive must be reachable over HTTP, and
contain directories with tarballs
* CRITICAL: the HTTP server must be up, and unauthenticated connection
must be forbidden (may require to install its TLS certificate, or to
skip certificate validation, or something)
## Nightly builds
* CRITICAL: <http://nightly.tails.boum.org/> must have directories for
the `stable` and `devel` branches, that contain ISO images
* CRITICAL: check, over `rsync://`, that expected directories are there
## Test suite infrastructure
* HIGH: the (fake or limited) SSH and SFTP access used by core
contributors and robots when running the test suite must be up
* CRITICAL: <https://tails.boum.org/> must be up and working
## WhisperBack relay
* HIGH: SMTP server is up
* MEDIUM: email is actually relayed (would be truly good to have, but
hard to implement, so the cost/benefit ratio is likely to be pretty
## XMPP server
* MEDIUM: responds on the TCP/IP port it is listening on