Draft masters proposal

A problem facing internet users worldwide is that of bandwidth availability. Bandwidth costs money – the more one needs, the more it costs. Local bandwidth, which connects systems within the same country, is generally very cheap, and therefore it is possible to obtain high bandwidth local connections at fairly low prices. However, international bandwidth is very costly, due to the fact that telecoms companies do not have exclusive control over international links, and do not have as much freedom in terms of upgrading these links to provide more bandwidth. Because of this, dedicated lines with guaranteed international bandwidth are extremely expensive. Even lines with non-guaranteed international bandwidth (e.g. Telkom’s ADSL), while not as expensive, have restrictions: ADSL subscriber traffic is monitored, and they are limited to 3Gb of international bandwidth per month. Once this limit has been reached, they are transferred to a slower international line, which is shared with all other subscribers who have reached their limit, effectively making international access impossible.

What should be fairly obvious is that a large amount of international traffic is duplicated: thousands of South African internet users would probably download the same video card drivers for example, from the manufacturers international website. If the driver download is 10Mb, and 1000 users download that file, then 10Gb of international bandwidth has been used for one 10Mb file. Figures are even larger when Microsoft releases service packs, which are typically greater than 100Mb. Since most corporate networks, and private users, update their systems with the latest service packs, we have tens of thousands of downloads of a file that is 100-200Mb in size. The amount of duplicated bandwidth is phenomenal.

What I propose to research and implement, is a “distributed proxy network” to conserve international bandwidth. A service will be installed on any machine that is to be connected to the proxy network, via a central proxy server, or hub. Users making use of the network will direct their browsers to use the service as a gateway to the internet.

When a user running the service wants to download a file from an international site, his browser will first check with the service to see if that exact file has been downloaded already. To accomplish this, the service will retrieve the header from the international site, and ask the hub if a matching file has already been downloaded by a user on the network.

  • If there is no match, the service downloads the file from the international site, and stores it on the user’s local machine. It will also transmit the headers for that file to the hub, along with the name of the machine which downloaded the file.
  • If a match exists, the hub provides the service with a list of all machines on the network which already have a copy of that file. The service will then download the file from one or many of the machines on the list. The user’s machine is then added to the list of hosts for that file.
  • It is important to note that this system is not a Peer to Peer network – users will not be able to browse other user’s machines and retrieve files that they want directly – the files will only be accessible via a redirected download request for a file from an international site.

    Of course, this method presents a few problems; possible solutions are presented afterwards

  • Privacy – a user downloads a pornographic video clip from an international site – he may not want other users on the network who may request the same video clip to know that he has downloaded that clip. Since the hub returns a list of machine names who are hosting a particular file, it is easy to tell which users on the network have the video clip. Privacy advocates may also argue that the hub is tracking a user’s online behaviour, by maintaining a list of files he has downloaded
  • Hard drive space – The service will download all files to a ‘shared’ directory on the users machine. While not a physical network share, that is the directory in which all hosted files will be stored. This directory would typically grow to a considerable size, depending on the amount of information downloaded
  • User selfishness – a user may download files from the network, but immediately remove them from the ‘shared’ directory, making it unavailable to the rest of the network (since other users on the network would use some of his local bandwidth to retrieve the file from him, he removes the file to conserve his bandwidth)
  • Web pages that are custom generated based on session variables or cookies – many sites use cookies or session variables to maintain login information, and generate pages on the fly specific to the user logged in. So, if one maintains a cookie to automatically login to www.xyz.com/home.php then the page retrieved for www.xyz.com/home.php for user A may be quite different to the page retrieved for www.xyz.com/home.php for user B, although both have the same filename, and user B should not have access to user A’s page.
  • Some websites may not want their content to be mirrored, as this could affect their stats, or bring about copyright or legal issues.
  • Possible solutions to the above problems include

  • Privacy – Although the proxy system is transparent to the user (there would not be a window saying ‘retrieving file x from user A and user B), a direct connection to users A and B must be established to retrieve a file from them. Thus, if a user has a firewall or similar program that monitors connections, he will be able to retrieve the IP addresses of the machines he connected to in order to receive a file, and thus infer whose machine he was connected to. A solution would be to not share files from the user’s machine at all, but to upload them to the hub, and have subsequent users download from there. However, this detracts from the whole ‘distributed’ aspect of the project (see below). Also, it would place a strain on the hub, whose sole function should be to manage file locations on connected machines. The moment one provides the facility to upload files to the hub, many ‘selfish users’ (see above) will abuse the system by uploading all their shared files to the hub, and not sharing anything from their own machines, justifying their actions by the theory that they contributed to the system by downloading the file initially. A potential solution to this problem could be to provide two options when downloading a file: Download to shared directory, and Download to private directory. If a user is downloading a sensitive file, he can choose to download it to a private directory, and the file will not be shared.
  • Hard drive space – The user will be allowed to specify how much hard drive space the system should have access to. If a user downloads more than this, then the system will automatically begin freeing up space by deleting the oldest files until sufficient space is available.
  • User selfishness – If the system detects that a particular user habitually removes shared files soon after download (i.e. the user is added to the list of hosts for a file, but when another user attempts to download the file from the user, the file is not available – in this case, the hub is notified that the file is no longer available from that host, and the host is removed from the list), the user may be blocked from the network, or, his priority in download queues etc may be dropped.
  • Custom pages – The system can be configured to not store certain files (php, asp, cgi, etc) and also to not store files smaller than a certain size – its quite possible that it would be more efficient to download a file that’s only a few bytes or kilobytes in size directly from the source, as opposed to the overhead of retrieving a list of hosts from the hub, and then making another connection to another machine to actually retrieve the file. As custom-generated file types are identified, they can be added to the list of ‘do not store’ files.
  • Legal/stats problems – Since each download makes a request to the international server to check if the target file has changed since it was locally mirrored, this may register as a download attempt by the international server, and the international server’s download stats may be correct. However, if this is not the case, a workaround should be found. With regards to legal and copyright issues, existing proxy systems will be examined to find out how this problem is addressed currently.
  • The system will also perform some sort of intelligent load balancing – when the hub generates its list of hosts, it will query each host to find out how much available bandwidth that host has. It will then instruct the user to download files based on this information. For example, if four users are hosting a 100Mb file, it may not be feasible to simply download 25Mb from each user – host A may be away from his machine, and so his connection may be idle, host B may be downloading a lot of information at the time, and so his connection may be fairly busy, and hosts C and D may have an average amount of bandwidth to spare. In this case, it would possibly be most efficient to download 50Mb from host A, 5Mb from host B, and the remaining 45Mb from hosts C and D based on which has more bandwidth available. It may also prove efficient to download 5Mb directly from the international site, and then only split the remaining 40Mb between hosts C and D. This feature can also be used when the system is presented with a ‘selfish user’ – he may be allowed less bandwidth than he would have ordinarily been allocated from the hosts he is downloading from.

    The system can possibly be designed with a few different configurations. While the most efficient would be for the hub to merely index files on various hosts, and allow other users to download files from multiple hosts, the possibility of the hub to also serve as a hub should not be discounted. In this scenario, users can download files, and, before deleting them from their machines, upload them to the hub to ensure that a local copy of the file is still available. The system would still in a sense be distributed, in that files that the hub serves were originally retrieved from other users. Of course care must be taken to prevent abuse of this system, as it would place extra strain on the hub’s bandwidth (local though it may be) and disk resources.

    Another aspect to consider is that while the system will benefit all users by providing local copies of international files, it is not practical for all users to serve as hosts – someone on a 56kbps internet connection would benefit from slightly faster downloads, but if he were to also serve as a host, his already minimal bandwidth would be crippled by requests for files he has downloaded, even with load balancing in place to restrict the bandwidth that is used for uploading files. Because of this, the broadband users of the network are placed at a slight disadvantage in that they are both acting as hosts and users, while other users do not donate their bandwidth as hosts. To compensate, a subscription service can be implemented, where low-bandwidth users are given access to the system for a fee, and users who allow their machines to serve as hosts have free access to the system. Or, a fee may be charged per megabyte or gigabyte of traffic downloaded from other hosts in the system, so that all users have to pay for access, but users who act as hosts will also be credited for the bandwidth that they use to serve files. In this way, the ‘selfish users’ will not receive an unfair advantage over their counterparts, in that everyone pays the same amount, and since the ‘selfish user’ chooses not to act as a host, he will not be credited for files that he shares.

    Comments

    dhiren Thu, 01/01/1970 - 02:00

    Well, this proposal was written almost a year and a half ago. Since then, alot has changed with the state of ADSL and stuff, and after chatting with my supervisor last week, it seems that the target for these distributed systems shouldn't be home adsl users any more (thanks for hardcapping an already crippled product, telkom!). So it makes more sense to implement this type of system at a higher level - for example, a number of small (or large) ISP's could enter into an agreement to share their caches between them to reduce their international costs.

    This doesnt affect the implementation too much, but it is an important factor to bear in mind when putting the system together (e.g. the system will have to be able to handle a much higher load if it is handling the caching for an entire ISP as opposed to a single user). Also, authentication mechanisms will probably have to work slightly differently.

    Also, since the official topic will be changed sometime to highlight the distributed caching part of the system (since that is the real focus of the research anyways), I'm going to look at various other uses of distributed caching/file-storage/-retrieval, instead of purely web-caching. This means that the distribute caching system will have to be abstracted in such a way that it can plug into various types of applications, like web-caching, p2p file sharing, distributed backup systems, etc.

    I'll re-work the proposal and edit the original post with a more accurate and current version of events asap

    Comment viewing options

    Select your preferred way to display the comments and click "Save settings" to activate your changes.