+Corresponding parent ticket: [[!tails_ticket 5734]]
+[[!toc levels=3]]
+# Introduction
+## Why?
+* Some pieces of our infrastructure are critical to e.g.:
+ - the development process (if the ISO build fails, developers
+ cannot work)
+ - the release process -- which may block us from putting out
+ emergency security fixes
+ - users (if the APT repository is down, the "additional software
+ packages" persistence feature is broken)
+* We want to avoid contributors getting used to ignore alerts sent by
+ our CI system. The more false positives there are, the more they
+ will "learn" to do so. Here we want to diminish the rate of false
+ positives caused by malfunctioning infrastructure.
+* We want to shorten the dev/feedback loop for sysadmins when they
+ deploy changes, and also when changes are automatically applied
+ (e.g. Puppet agent passes, or automatic APT upgrades).
+* We want to be notified when a service we run doesn't come back up
+ properly post-reboot, without having to manually test every service.
+* We want to minimize the rate of non-sysadmins discovering and
+ reporting problems _first_, that is before we learn about it.
+ This is highly subjective, but replying "we're aware of this problem
+ and are working on it" is much more confidence inspiring than
+ "really, it's broken?"
+## Nomenclature
+Here, we call:
+* _machine_: a computer (be it bare metal or virtual) and its
+ operating system
+* _monitored machine_: a machine we monitor
+* _monitoring machine_: the machine(s) that monitors the... _monitored
+ machines_
+* _monitoring system_, or _monitoring setup_: all the software
+ components that we run so that the monitoring machine can monitor
+ the monitored ones, and their configuration
+Note that the monitoring machine may very well be, at the same time,
+itself be monitored (be it by itself, or by another monitoring
+## Human interface
+The monitoring system:
+* MUST send email notifications to the sysadmin(s) in charge, to lower
+ the downtime.
+* MUST offer an overview of the status of our systems, via a web
+ interface that works within Tor Browser with the security slider set
+ to Medium-High.
+* MAY additionally offer a read-only version of this overview, that we
+ may want to make available to selected contributors, or anonymous
+ users. Needless to say, this must be carefully balanced with the
+ security implications of such a system (in other words, a set of
+ exported static HTML pages is totally fine, but a huge dynamic web
+ application is probably a non-starter).
+* MUST support configuring, with a per-check/per-service granularity,
+ a threshold of N failures _in a row_ before an alert is raised.
+ Still, it SHOULD support triggering alerts depending on the
+ frequency of such failures, even when they never fail twice in a row
+ (we don't want to miss the fact that `$service` is down for
+ 5 minutes every day). Implementation details may vary, but you get
+ the idea.
+## Threat model
+### Compromised monitored machine
+* We do not try to avoid the fact that it can report wrong information
+ (this includes missing information) about itself.
+* It MUST NOT result in a compromise of the monitoring machine.
+* It MUST NOT be able to DoS the sysadmin(s) in charge, e.g.
+ by flooding them with alerts.
+* It MUST NOT result in a compromise of the network traffic between
+ other monitored machines and the monitoring machine (e.g. if that
+ traffic is encrypted, the monitored machines MUST NOT use the same
+ private key).
+* It SHOULD NOT be able to alter the information about other
+ monitored machines.
+### Compromised monitoring machine
+* We do not try to avoid the fact that it can DoS the sysadmin(s) in
+ charge, e.g. by flooding them with alerts.
+* We do not try to avoid the fact that it can report wrong information
+ about the monitored machines.
+* It MUST NOT be able to run arbitrary code as root on any of the
+ monitored machines.
+* It SHOULD NOT be able to run arbitrary code as a non-privileged user
+ on any of the monitored machines.
+### Network attacker
+Here, we consider an attacker that may be active or passive, and can
+sit at any point they choose on the Internet.
+We accept the risk that a network attacker:
+* can enumerate the machines and services we monitor;
+* can view the reports, test results, and any such information about
+ monitored services, that the monitoring system needs to learn; this
+ of course implies that we should be careful about what kind of
+ information flows this way: it MUST NOT be a big deal if it leaks
+ into the hands of an adversary;
+* can DoS our monitoring, e.g. by blocking network connections;
+* can spoof the reports, test results and alike about monitored
+ services that a client has no credible means to authenticate.
+However, a network attacker:
+* SHOULD NOT be able to spoof the reports, test results and alike
+ that monitored machines send about themselves;
+* MUST NOT be able to run arbitrary code on the monitored machines;
+* MUST NOT be able to run arbitrary code on the monitoring machine.
+## Availability, sustainability
+Here, we assume that the entire monitoring system has both software
+components that run on the monitored machines (that we call the
+"agent"), and software components that run on the monitoring machine
+(that we call the "server"). Below, the _agent_ implicitly includes
+anything needed for basic usage (plugins, checks, whatever); and
+similarly, the _server_ implicitly includes its web interface, and
+anything needed for basic usage (plugins, checks, etc.).
+* The agent MUST be usually available in all of Debian oldstable,
+ stable, and testing -- possibly thanks to _pre-existing_ and
+ well-maintained official backports. All these versions of the agent
+ MUST be compatible with the chosen version of the server.
+* The server MUST be usually available either in current Debian stable
+ (Jessie), or in current Debian testing (Stretch). We are considering
+ running the version from Debian testing mainly because it might
+ avoid having to go through a costly upgrade process in a couple
+ years, e.g. to switch to the next major, incompatible version of
+ the software.
+* Both the agent and the server MUST be actively maintained in all the
+ versions of Debian we care about (see above). Hint: this excludes
+ Nagios 4.
+* Both the agent and the server MUST be DFSG-free.
+* For all involved software, the upstream project MUST be mature and
+ active. It MUST have a confidence inspiring future. We can't afford
+ having to migrate to a totally different monitoring setup in three
+ years, to the extent that this can be foreseen. Hint: given Nagios 4
+ is not an option (see above), this in turn excludes all older
+ versions of Nagios.
+* It SHOULD be realistically possible for external contributors to
+ have patches merged into the upstream codebase of the
+ involved software.
+* All the involved softwares MUST have a not-too-scary security
+ track record.
+## Configuration
+Here, we have two major desires. One is the ability for humans to
+easily review the monitoring system's configuration, or changes
+proposed to it, so that contributions are made easier. The other is
+the ability to include monitoring aspects within the description of
+the services we run, in a self-contained way, so that describing them
+in puppet is easier. Note that a system that satisfies the second
+requirement has great chances to also mostly satisfy the first one as
+The chosen monitoring system:
+* SHOULD allow encoding, in the description of a service (read: in the
+ corresponding Puppet class), how it needs to be monitored.
+ - Additionally, if this optional (but warmly welcome) requirement is
+ satisfied, then the "shared Puppet modules" we use SHOULD already
+ support the chosen monitoring system (hint: in practice, this
+ means something compatible with Nagios).
+ - Note: this gives us for free the ability to review the monitoring
+ configuration for service checks, but it is unrelated to our
+ ability to review the global configuration of the server
+ components, that run on the monitoring machine.
+* SHOULD allow humans to easily review the service checks
+ configuration. Really, that's a *strong* SHOULD. A system that
+ doesn't make this possible will need to have very serious advantages
+ in other areas to be attractive to us.
+* SHOULD allow humans to review the global configuration of the server
+ components, that run on the monitoring machine. This assumes that
+ said configuration is mostly static, and is unaffected when adding
+ or modifying service checks.
+## Adequacy to our resources
+Being able to operate the monitoring system for 20-50 monitored
+systems MUST NOT require Tails sysadmins to invest lots of time and
+become experts at hand-holding a complex software stack: the main
+focus of our system and automation engineers shall not become
+monitoring. For example, we won't like a monitoring system that is
+trivial to set up for monitoring 5-10 hosts, but requires adding more
+and more moving parts and complex optional components to be able to
+scale up to 50 hosts.
+## Miscellaneous
+* We run Tor hidden services, that we want to monitor, so the
+ monitoring system MUST allow using a configured SOCKS proxy for
+ specific checks (worst case, for _all_ checks, but it prevents us
+ from). Wrapping checks with `torsocks` might be an acceptable
+ option, depending on how involved and hackish this would be. Ability
+ to retry and not notify on first error is interesting here.
+## Hosting of the monitoring machine
+* The monitoring machine MUST be a virtual machine.
+* We MUST be enabled to admin the OS of the monitoring machine
+ ourselves: we need to be root, we need to have a Puppet agent that
+ talks to our own puppetmaster, we want to do the initial
+ OS installation.
+* The monitoring machine MUST be hosted on infrastructure managed by
+ people the Tails sysadmins trust quite a bit.
+* The people who manage the underlying hardware and infrastructure
+ MUST be reactive and easy to get in touch with.
+* We MUST be given out-of-band access to the monitoring machine.
+* The monitoring machine MUST have unfiltered access to the Internet,
+ and SHOULD be assigned at least one public IPv4 address.
+* Hosting MUST be affordable (say, max. 20€/month).
+* The monitoring machine SHOULD allow at least some flexibility
+ regarding future "hardware" upgrades (e.g. allocating more disk
+ space, memory, CPU cores).
+* TODO: exact hardware specifications, depending on the chosen
+ monitoring system. Let's keep in mind that collecting exported
+ Puppet resources is expensive.
+<a id="services"></a>
+# Service and system checks
+Below, HIGH, MEDIUM and LOW are priority level wrt. the implementation
+of such checks.
+For description of individual services, see
+## All systems
+* HIGH: up and running!
+* HIGH: disk space usage (bytes and inodes)
+* HIGH: memory usage
+* MEDIUM: Puppet agent last run
+* MEDIUM: APT indices (aka. `apt-get update` was successfully run recently)
+* MEDIUM: `systemctl is-system-running` (see [[!tails_ticket 8262]])
+## APT repository
+* CRITICAL: `stable` APT suite over HTTP
+* CRITICAL: freezable APT repository, once it exists
+## Bitcoind
+* MEDIUM: compare `getblockcount` with what the Internet says it
+ should be (probably requires exporting the output of `bitcoin-cli
+ getblockcount` to a place that's readable by the monitoring agent)
+## BitTorrent
+* LOW: last Tails release is seeded
+## Gitolite
+* MEDIUM: `git pull` or `git clone` a test repository over all
+ supported protocols (currently: `git://` and SSH)
+## git-annex
+* HIGH: our Tor Browser archive must be reachable over HTTP, and
+ contain directories with tarballs
+## Jenkins
+* CRITICAL: the HTTP server must be up, and unauthenticated connection
+ must be forbidden (may require to install its TLS certificate, or to
+ skip certificate validation, or something)
+## Nightly builds
+* CRITICAL: <> must have directories for
+ the `stable` and `devel` branches, that contain ISO images
+## rsync
+* CRITICAL: check, over `rsync://`, that expected directories are there
+## Test suite infrastructure
+* HIGH: the (fake or limited) SSH and SFTP access used by core
+ contributors and robots when running the test suite must be up
+## Website
+* CRITICAL: <> must be up and working
+## WhisperBack relay
+* HIGH: SMTP server is up
+* MEDIUM: email is actually relayed (would be truly good to have, but
+ hard to implement, so the cost/benefit ratio is likely to be pretty
+ bad)
+## XMPP server
+* MEDIUM: responds on the TCP/IP port it is listening on