A problem facing internet users worldwide is that of bandwidth availability. Bandwidth costs money – the more one needs, the more it costs. Local bandwidth, which connects systems within the same country, is generally very cheap, and therefore it is possible to obtain high bandwidth local connections at fairly low prices. However, international bandwidth is very costly, due to the fact that telecoms companies do not have exclusive control over international links, and do not have as much freedom in terms of upgrading these links to provide more bandwidth. Because of this, dedicated lines with guaranteed international bandwidth are extremely expensive. Even lines with non-guaranteed international bandwidth (e.g. Telkom’s ADSL), while not as expensive, have restrictions: ADSL subscriber traffic is monitored, and they are limited to 3Gb of international bandwidth per month. Once this limit has been reached, they are transferred to a slower international line, which is shared with all other subscribers who have reached their limit, effectively making international access impossible.
What should be fairly obvious is that a large amount of international traffic is duplicated: thousands of South African internet users would probably download the same video card drivers for example, from the manufacturers international website. If the driver download is 10Mb, and 1000 users download that file, then 10Gb of international bandwidth has been used for one 10Mb file. Figures are even larger when Microsoft releases service packs, which are typically greater than 100Mb. Since most corporate networks, and private users, update their systems with the latest service packs, we have tens of thousands of downloads of a file that is 100-200Mb in size. The amount of duplicated bandwidth is phenomenal.
What I propose to research and implement, is a “distributed proxy network†to conserve international bandwidth. A service will be installed on any machine that is to be connected to the proxy network, via a central proxy server, or hub. Users making use of the network will direct their browsers to use the service as a gateway to the internet.
When a user running the service wants to download a file from an international site, his browser will first check with the service to see if that exact file has been downloaded already. To accomplish this, the service will retrieve the header from the international site, and ask the hub if a matching file has already been downloaded by a user on the network.
It is important to note that this system is not a Peer to Peer network – users will not be able to browse other user’s machines and retrieve files that they want directly – the files will only be accessible via a redirected download request for a file from an international site.
Of course, this method presents a few problems; possible solutions are presented afterwards
Possible solutions to the above problems include
The system will also perform some sort of intelligent load balancing – when the hub generates its list of hosts, it will query each host to find out how much available bandwidth that host has. It will then instruct the user to download files based on this information. For example, if four users are hosting a 100Mb file, it may not be feasible to simply download 25Mb from each user – host A may be away from his machine, and so his connection may be idle, host B may be downloading a lot of information at the time, and so his connection may be fairly busy, and hosts C and D may have an average amount of bandwidth to spare. In this case, it would possibly be most efficient to download 50Mb from host A, 5Mb from host B, and the remaining 45Mb from hosts C and D based on which has more bandwidth available. It may also prove efficient to download 5Mb directly from the international site, and then only split the remaining 40Mb between hosts C and D. This feature can also be used when the system is presented with a ‘selfish user’ – he may be allowed less bandwidth than he would have ordinarily been allocated from the hosts he is downloading from.
The system can possibly be designed with a few different configurations. While the most efficient would be for the hub to merely index files on various hosts, and allow other users to download files from multiple hosts, the possibility of the hub to also serve as a hub should not be discounted. In this scenario, users can download files, and, before deleting them from their machines, upload them to the hub to ensure that a local copy of the file is still available. The system would still in a sense be distributed, in that files that the hub serves were originally retrieved from other users. Of course care must be taken to prevent abuse of this system, as it would place extra strain on the hub’s bandwidth (local though it may be) and disk resources.
Another aspect to consider is that while the system will benefit all users by providing local copies of international files, it is not practical for all users to serve as hosts – someone on a 56kbps internet connection would benefit from slightly faster downloads, but if he were to also serve as a host, his already minimal bandwidth would be crippled by requests for files he has downloaded, even with load balancing in place to restrict the bandwidth that is used for uploading files. Because of this, the broadband users of the network are placed at a slight disadvantage in that they are both acting as hosts and users, while other users do not donate their bandwidth as hosts. To compensate, a subscription service can be implemented, where low-bandwidth users are given access to the system for a fee, and users who allow their machines to serve as hosts have free access to the system. Or, a fee may be charged per megabyte or gigabyte of traffic downloaded from other hosts in the system, so that all users have to pay for access, but users who act as hosts will also be credited for the bandwidth that they use to serve files. In this way, the ‘selfish users’ will not receive an unfair advantage over their counterparts, in that everyone pays the same amount, and since the ‘selfish user’ chooses not to act as a host, he will not be credited for files that he shares.

Comments
Well, this proposal was written almost a year and a half ago. Since then, alot has changed with the state of ADSL and stuff, and after chatting with my supervisor last week, it seems that the target for these distributed systems shouldn't be home adsl users any more (thanks for hardcapping an already crippled product, telkom!). So it makes more sense to implement this type of system at a higher level - for example, a number of small (or large) ISP's could enter into an agreement to share their caches between them to reduce their international costs.
This doesnt affect the implementation too much, but it is an important factor to bear in mind when putting the system together (e.g. the system will have to be able to handle a much higher load if it is handling the caching for an entire ISP as opposed to a single user). Also, authentication mechanisms will probably have to work slightly differently.
Also, since the official topic will be changed sometime to highlight the distributed caching part of the system (since that is the real focus of the research anyways), I'm going to look at various other uses of distributed caching/file-storage/-retrieval, instead of purely web-caching. This means that the distribute caching system will have to be abstracted in such a way that it can plug into various types of applications, like web-caching, p2p file sharing, distributed backup systems, etc.
I'll re-work the proposal and edit the original post with a more accurate and current version of events asap