summaryrefslogtreecommitdiffstats
path: root/wiki/src/blueprint/HTTP_mirror_pool.mdwn
blob: 36d25161e6728e97d16b3b4d96967512f2b9f3b7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
**Ticket**: [[!tails_ticket 7161]]

The idea I had was to let the server(s) send a reduced list of hosts. Not
only it would allow to work-around Tor DNS limitations, but also to have
some weighted round robin, in order to prioritize some high bandwidth
mirrors, if we choose to.

If I had to mention the ideal design goals for such changes, I would say
that the more straightforward would be the better for implementation and
also for maintainability.

[[!toc levels=3]]

# The plan

## Big picture

We decided to implement a two-way strategy for this feature:

* Use JavaScript to modify the link on the download page, so that each
  user is pointed to random mirror.
  - Vanilla JS (no frameworks)
  - Store the JS code and its configuration file in two dedicated
    ikiwiki overlays, for finer-grained access control possibilities
    to it (e.g. we may want to let people who don't have commit access
    to Git maintain the mirrors pool).
  - Configuration for the JS is loaded from JSON file. It attaches
    a weight to each mirror. Weight 0 means that the mirror is
    currently disabled, and will never be redirected to. We're using
    JSON and not YAML to avoid the need to use a third-party parser.

* Keep using DNS to point to 3-5 fast and reliable mirrors. This will
  be the fallback for people who do not use JS. So we still need a DNS
  dynamic update system; we can simply re-purpose the one we already
  have (`dl.amnesia.boum.org`).

* The ISO verification extension also needs to use mirrors. So, we'll
  be providing library code for it to do the same, internally, as
  a web browser would do when visiting the Tails download page, that
  is replacing the hostname, in a ISO download URL, with a suitable
  mirror's hostname (using the JSON mirror pool configuration file).

# Initial research

## Using DNS

Using DNS seems to be an easy way to do some round robin in low level. It
allows some kind of transparency to the upper layers protocols and
distribute the load and to avoid having a single server that can became a
SPOF.

The following ways are available to implement it:

* CNAME Hacks
* NS Hacks
* Modified DNS servers

### CNAME Hacks

As mention by ToBeFree something that can be done is to have different
pools of servers like:

	a.dl.amnesia.boum.org A $MIRROR1
	a.dl.amnesia.boum.org A $MIRROR2
	a.dl.amnesia.boum.org A $MIRROR3
	a.dl.amnesia.boum.org A $MIRROR4
	a.dl.amnesia.boum.org A $MIRROR5

	b.dl.amnesia.boum.org A $MIRROR6
	b.dl.amnesia.boum.org A $MIRROR7

	dl.amnedia.boum.org CNAME a.dl.amnesia.boum.org
	dl.amnedia.boum.org CNAME b.dl.amnesia.boum.org

Interestingly the requests would be equally distributed betwen a.dl and
b.dl, thus if their is more mirrors in one name than one other some
servers would be somehow prioritized. For example: here "a" mirrors will
share 50% of requests, giving 10% for every host where "b" mirrors will
share the other 50% of requests betwen two host giving them 25% of
requests each.

However this kind of CNAME hack is not supported by current DNS
Servers. Bind 8 used to support it with a [configuration
option](http://docstore.mik.ua/orelly/networking_2ndEd/dns/ch10_07.htm)
that has not been ported to bind 9. Neither NSD nor PowerDNS seem to
support it, and their is no actual data about how resolvers would
handle this case, so I don't think it is the best option.

### NS Hacks

Following the same idea the dl amnesia.boum.org could be delegated to a
few different DNS servers, and those servers may have different versions
of the zone. For example:

	dl.amnesia.boum.org NS $DNS1
	dl.amnesia.boum.org NS $DNS2
	
DNS1 would have a zone similar to a.dl.amnesia.boum.org:

	dl.amnesia.boum.org A $MIRROR1
	dl.amnesia.boum.org A $MIRROR2
	dl.amnesia.boum.org A $MIRROR3
	dl.amnesia.boum.org A $MIRROR4
	dl.amnesia.boum.org A $MIRROR5
	
And DNS2 would have a zone similar to b.dl.amnesia.boum.org:

	dl.amnesia.boum.org A $MIRROR6
	dl.amnesia.boum.org A $MIRROR7

In theory it should work (and give almost the same load distribution as
CNAME hacks, almost as the NS servers will not receive 50% of requests
because of [[!rfc 5452]]). However, I am not sure that playing with DNS
inconsistency will be a so good idea, for example for maintainability :)

### Using modified DNS servers

Interestingly Tails is not the first project to be looking how to use DNS
for load distribution. People already wrote some DNS software designed to
handle those usecases and return the visitor a reduced list of servers
according to some rules like weights or geolocalisation. They work by
delegating a subzone (like dl.amnesia.boum.org) to those servers and with
zone files containing additionals fields. There is two main softwares for
those usecases:

* <http://gdnsd.org/> which is available on debian and used for
  example for wikipedia.

* <https://github.com/abh/geodns> that requires manual installation
  and is used for example by pool.ntp.org.

Deploying such software would solve the problem in a more elegant way than
CNAME or NS hacks. It would require a bit of system administration that
maybe can be done using some puppet templates in a few Virtal Machines.

## Using HTTP(s)

DNS is not the only way to do some load balancing. It is mostly used for
low level protocols that don't allow redirects (for example: NTP). As
content download is already done using HTTP(s). HTTP(s) can be leveraged
to do this kind of load balancing. It is what is done by sourceforge.net
as pointed by ToBeFree and by many Linux distributions.

For example using a PHP script (or more complete options such as
mirrorbrain, thanks Sagi!) that would redirect requests to
`dl.amnesia.boum.org/$file to $mirror.dl.amnesia.boum.org/$file`
randomly or according to some additional rules (weights,
geolocalisation, SSL availability ...).

There is a few drawbacks on this approach:

* It would increase a bit the load on boum.org's server.
* It would increase the dependency on this server, meaning that it is
  unavailable (down, blocked...), downloads will be blocked (but in this
  case the site will be too).
* It would require to develop the script or to install one such as
  mirrorbrain (which is feature-full, available as [3rd party Debian
  packages](http://download.opensuse.org/repositories/Apache:/MirrorBrain/),
  and requires PostgreSQL).

On the other side it has a few advantage:

* It will only require a few ~20 lines of PHP script when DNS based
  solutions require to install and maintain additional software and servers.
* It can allow the script to be personnalised to add some additional rules
  if necessary.
* As boum.org server will see every requests, it would allow to do some
  stats.
* It can allow to use $mirror.dl.amnesia.boum.org URLs, allowing to deploy
  SSL certificates easier that if all mirrors use dl.amnesia.boum.org.
* As ToBeFree mentionned (thanks!) it is also possible to use some client
  side scripts to select the mirror. I would not recommend to rely only on
  this option as it would not work for browsers without scripts, but it can
  be done as a complementary approach, it order to reduce the load and
  dependency to dl.amnesia.boum.org's server.

Thus, if I may, I would like to recommend considering the HTTP(s) option,
even if it means that I have to write the PHP script by myself or to create
an easy task entry on the ticket tracker and follow it :)

## Proof of concept: JavaScript + multiple DNS pools / named mirrors

This method can either be used with multiple DNS pools (dl1.amnesia.boum.org, dl2.amnesia.boum.org etc.) or with named mirrors (freiwuppertal.dl.amnesia.boum.org, othermirror.dl.amnesia.boum.org, ...). Using named mirrors allows you to use a huge, unlimited list of completely equally used mirrors; using multiple DNS pools leads to effects described under "CNAME hacks".

These POCs should be 1:1 usable on the Tails [[download]] page. All that would be needed is setting up the DNS pools and/or named mirrors, and telling the mirror owners to configure their servers to respond to \*.amnesia.boum.org (the wildcard is important).

### JavaScript POC (multiple DNS pools)
   
    <script src="//code.jquery.com/jquery.min.js"></script>
    <script type="text/javascript">//<![CDATA[ 
    $(window).load(function(){
        var hosts = Array("dl.amnesia.boum.org","dl2.amnesia.boum.org","dl3.amnesia.boum.org","dl4.amnesia.boum.org","dl5.amnesia.boum.org");
        var host = hosts[Math.floor(Math.random()*hosts.length)];
    $(document).ready(function() {
        var strNewString = $('body').html().replace(/dl\.amnesia\.boum\.org/g,host);
        $('body').html(strNewString);
    });
    });//]]>  
    </script>

For this to work and to be flexible, mirrors need to respond to \*.amnesia.boum.org. Just responding to one of the pool names would make this a very unflexible solution, so the wildcard is needed. 

At least nginx is unable to use a wildcard like dl\*.amnesia.boum.org, so \*.amnesia.boum.org has to be used. This is more flexible anyway.

#### Example webpage (see the webpage source there too)

<http://freiwuppertal.de/tails-mirror-example-dns.htm>


### JavaScript POC (named mirrors)
    
    <script src="//code.jquery.com/jquery.min.js"></script>
    <script type="text/javascript">//<![CDATA[ 
    $(window).load(function(){
        var hosts = Array("freiwuppertal.dl.amnesia.boum.org","othermirror.dl.amnesia.boum.org","yetanother.dl.amnesia.boum.org","weirdname.dl.amnesia.boum.org","supermirror.dl.amnesia.boum.org");
        var host = hosts[Math.floor(Math.random()*hosts.length)];
    $(document).ready(function() {
        var strNewString = $('body').html().replace(/dl\.amnesia\.boum\.org/g,host);
        $('body').html(strNewString);
    });
    });//]]>  
    </script>

For this to work and to be flexible, mirrors need to respond to \*.amnesia.boum.org. Just responding to a fixed name would make this an unflexible solution, so the wildcard is needed.

#### Example webpage (see the webpage source there too)

<http://freiwuppertal.de/tails-mirror-example-named.htm>

#### Giving mirrors higher or lower weight

Using this approach, giving one mirror more weight than others is very easy: Simply add it's name multiple times to the array of mirrors. :D

### Vanilla JavaScript POC and JSON

        <a href="http://dl.amnesia.boum.org/tails/stable/tails-i386-1.6/tails-i386-1.6.iso" id="dllink">download link</a>

        <script type="text/javascript">
        function fetchJSONdata(path, callback) {
            var xhr = new XMLHttpRequest();
            xhr.onreadystatechange = function() {
                if (xhr.readyState === 4) {
                    if (xhr.status === 200 || xhr.status === 0) {
                        var data = JSON.parse(xhr.responseText);
                        if (callback) callback(data);
                    } else {
						console.log( "Error: " + xhr.statusText);
					}
				}
			};
            xhr.open('GET', path, true);
            xhr.send();
        }

		function getRandomInt(min, max) {
  			return Math.floor(Math.random() * (max - min +1)) + min;
		}

		function isJSON(str) {
			try {
				JSON.parse(str);
			} catch (e) {
				return false;
			}
			return true;
		}

		function replaceDownloadURL(updatedURL) {
			var URLMarker = "/tails/stable";
			// todo check that url is a correct url
			var linkDOMElem = document.getElementById('dllink');
			var linkHREF = linkDOMElem.href.split( '//' );
			var linkToISO = linkHREF[1].split( URLMarker );
			// fixme http or https
			linkDOMElem.href = '//' + updatedURL + URLMarker + linkToISO[1];
			return true;
		}

        fetchJSONdata('./mirrors.json', function(data){
            //console.log(data);
			if( data == "undefined" ) {
				console.log( "Error: mirror data not loaded.");
			} else if( !isJSON( JSON.stringify(data) ) ) {
				console.log( "Error: mirror data is not JSON.");
			} else {
				//console.log(data.mirrors);
				// todo delete all mirrors with weight 0 before choosing one
				if(data.mirrors.length > 0 ) {
					var activeMirrors = new Array();
					for ( i = 0; i < data.mirrors.length; i++ ) {
						if ( data.mirrors[i].weight != 0 ) {
							// add mirror as many times as its weight, max weight is 5
							if ( parseInt(data.mirrors[i].weight ) > 5) {
								var max_weight = 5;
							} else {
								var max_weight = parseInt( data.mirrors[i].weight );
							}
							for ( w = 0; w < max_weight; w++ ) {
								activeMirrors.push( data.mirrors[i] );
							}
						}
					}
					console.log(activeMirrors);

					var randomMirror = getRandomInt(0, activeMirrors.length-1);
					//console.log(randomMirror);
					//console.log(data.mirrors[randomMirror]);
					replaceDownloadURL(activeMirrors[randomMirror].url);
				}
			}
        });
    </script>

The mirrors.json file contains:
<pre>
   {
	"mirrors": [
		{ "url": "1.dl.amnesia.boum.org", "weight": "10" },
		{ "url": "5.dl.amnesia.boum.org", "weight": "5" },
		{ "url": "6.dl.amnesia.boum.org", "weight": "6" },
		{ "url": "3.dl.amnesia.boum.org", "weight": "0" }
	]
    }
</pre>


## PHP: first draft


    // http://stackoverflow.com/questions/4233407/get-random-item-from-array
    
    $mirrors = Array("alice.amnesia.boum.org","bob.amnesia.boum.org","clark.amnesia.boum.org","deborah.amnesia.boum.org","eric.amnesia.boum.org","freiwuppertal.amnesia.boum.org");
    $mirror = $mirrors[array_rand($mirrors)];
    echo "<p><a href=\"http://{$mirror}/tails/stable/tails-i386-1.4/tails-i386-1.4.iso\">Download Tails!</a></p>\n";
    echo "<p>Selected mirror: {$mirror}</p>";

Try it here:
http://sandbox.onlinephpfunctions.com/code/54ffcc18e5dbbafc6c7d3c81e0c26f94ce7946fc

Note: I am a horrible coder and basically copied this from the linked StackOverflow page. This page also helped me: http://php.net/manual/de/function.echo.php
...and that's all. There might be security flaws in this extremely simple concept, so please have a close look at it. :)