A Desperate Plea for a Free Software Alternative to Aspera
Projects
Workshops
Our Impact
Publications
Scientific Collaboration
Blog
Donate<br>Make a Donation
A Desperate Plea for a Free Software Alternative to Aspera
April 12, 2019
RICH JONES
Too Long, Didn’t Read<br>tl;dr: I work at the Childhood Cancer Data Lab , where we use very big data to find cures for childhood cancers. To move data around the internet at very high speeds, we are forced to use a proprietary software suite called Aspera. If somebody could make a Free Software alternative, the future of the internet would be way more awesome! Best of all, you can be the one to do it!<br>A Big Data Transfer Protocol<br>For the past thirty years, the web’s HTTP protocol has been really great at serving things like web pages, chat messages, images, and other objects at the kilobyte to megabyte scale. Unfortunately, the web was not designed for the “big data” objects of today - genetic sequences, large databases, high definition media, and other files at the terabyte to petabyte scale. HTTP transfers over the open internet are hampered by constant connection re-establishment and TCP overhead. Some reconnection problems have been addressed in HTTP/2, but not the overall throughput issues.<br>The fact is, for very large files, HTTP is slow.<br>However, there is an alternative, proprietary protocol - FASP , the Fast and Secure Protocol - more known commonly by the client/server software product name Aspera , which avoids these problems. Aspera can deliver speeds up to 1000x times faster than HTTP/FTP. I honestly didn’t believe it until I saw it for myself. In fact, I had never even heard of Aspera before I started this job, and there wasn’t even an English-language Wikipedia article for FASP yet - although I’ve since written one.<br>Out of necessity, many organizations are now using this proprietary software for large data transfers, including public institutions such as the National Institute of Health, the European Genome Archive/Sequence Read Archive, the National Cancer Institute/cancer.gov, FEMA and private companies, such as Amazon, Netflix and the owner of Aspera, IBM. Basically, if you’re working at terabyte scale, you probably need this software.<br>Demonstration<br>Let me reproduce an experiment for you. We’re going to download an experiment’s worth of RNA-Seq expression data from the Sequence Read Archive, which offers both FTP and Aspera downloads. On the left is FTP with wget, which starts first, and on the right the Aspera client, ascp. This test was performed on two different EC2 instances in us-east-1, fetching a 343MB FASTQ file from the Sequence Read Archive, which is in the UK.
As you can see, the ASCP client performs over 200x faster than FTP, and can download the whole file in 9 seconds. It’s pretty magical.<br>It’s also necessary for our project’s basic feasibility. Without Aspera, it would take us months to years to download the millions of samples like this that we need to process. We’d have to fly all over the world with suitcases full of hard drives to be able to collect our data!<br>A Proprietary Protocol, a Free Software Solution<br>In theory, the protocol is simple-ish. The client establishes an SSH connection to the server for negotiating the control of data flow, and then a range of UDP ports above 33001, one for each connection thread, are used to transmit a unidirectional transfer of data, and requests for retransmission are sent back over the SSH connection.<br>I say in theory, however, because the software is proprietary. So, I have no way of knowing what the protocol is actually doing during a file transfer, just as I have no way of knowing what the client software is doing to my laptop, or what the server software is doing to my servers.<br>Aspera is owned by IBM - does their client spy on what else is running on my system and report it back to their headquarters as business intelligence? If asked, would they share this information with the government? Without the source code, it’s impossible to tell!<br>Those were things people used to worry about in the very early days of the internet, until the invention of the Apache web server, which, along with GNU/Linux, truly democratized the web and allowed the explosion of growth which we all benefitted from.<br>I think a Free Software implementation of Aspera, or something similar, will cause a similar revolution for big data.<br>Implications<br>There are obviously many immediate practical implications for having a Free FASP client/server/library implementation.<br>Science<br>Firstly, and most relevantly to our work at the Childhood Cancer Data Lab, biologists, physicists and other scientists who deal with “big data” won’t have to use proprietary software to retrieve and share the data they need in a timely fashion. That alone would be a massive boost for science, for software freedom and for data sharing everywhere. But the benefits don’t stop there!<br>Technology<br>There are also massive implications for system administrators...