ARTHUR KING PETERS TRAVEL GRANT PROJECT SUMMARY
"I will implement a new algorithm for a data analysis tool that I have developed this year. The first six weeks will be software development at the DataShape lab in Sophia Antipolis, and the remaining two weeks will be visits to two research communities who will be the first users of the software and heavily increase its dissemination: Université Nice Sophia Antipolis and INRIA Paris-Saclay. My current Fulbright project is developing new algorithms for Topological Data Analysis (TDA). TDA is a variety of topological methods for analyzing structure in massive datasets, and it has found major applications in many domains including medicine, finance, and astronomy. My host laboratory, INRIA DataShape, is geographically spread across two locations: Saclay (near Paris, in northern France) and Sophia Antipolis (near Nice, in southern France). This year at DataShape Saclay, I have been developing single-access algorithms for TDA, where single-access means that each datapoint may be accessed only once. This restriction reflects the reality of big data because typically a computer with relatively little local memory sifts through an enormous dataset where it is time-prohibitive to return to previously-viewed data. The restraint of only allowing a single access comes with the trade-off of requiring more local memory for the computation. However, as a great surprise, for a wide class of clustering functions I have discovered a technique that allows a single-access algorithm to exactly match the local memory requirement of multiple-access algorithms. A paper detailing the theoretical foundation of this technique is currently pending publication in the IEEE Foundations of Computer Science conference.
Having discussed my findings with Jean-Daniel Boissonnat, the research director of DataShape Sophia Antipolis, we plan to implement my algorithm in the C++ programming language and release it for free and public use. Whereas the Saclay branch (my current location) focuses on the theory of TDA, the Sophia Antipolis branch deals heavily with the practical applications of TDA and has a team of full-time software engineers. This is the ideal location to implement this algorithm because the team is already familiar with working on TDA and the director has agreed to assign two software engineers to work with me over a period of eight weeks.
Boissonnat and I have worked out the following structure for the two months. During the first week, I will introduce the engineers to my technique and explain the algorithm. In the second through fourth weeks, we will code a prototype version of the program. In the fifth and sixth weeks, we will optimize the code and add support for parallel architectures. In the remaining two weeks, the engineers will run benchmark tests on several classic datasets. During these two weeks, I will spend my time at Université Nice Sophia Antipolis and then INRIA Paris-Saclay to familiarize the data analysis communities with this new technique, promote the new software, and discuss directions for future work. I have selected these two locations because the researchers are involved with maintaining the widely-used software library “Gudhi”. Inclusion into the Gudhi library would rapidly disseminate our program to a wide community of users across the world.
The impacts of this research can span across many disciplines. In an ongoing project with astronomy professors at Johns Hopkins University, we are seeking to detect large-scale structure of the early universe using a dataset over fifty times larger than can be stored on a standard computer. Over the past 25 years, this dataset had only been analyzed by supercomputers. This year, by developing a single-access algorithm, we have been able to detect the same structures while using a personal laptop instead. This drastic example is by no means exceptional - the gains in efficiency of single-access algorithms can outweigh the computational difference between a laptop and the world’s most powerful computing machines.
The impact of this two month project will extend beyond the code that will be posted online for anyone - from academic researchers to industry engineers - to freely download and use. After the grant period, using the results of the tests completed by the engineers during the final two weeks, I will write an article that compares our program to existing work. This article will be submitted to the ACM Symposium on Applied Computing. As a later impact, since my method can be applied more generally, I believe that this software will serve as a proof-of-concept that will motivate other researchers to apply this technique to create improved single-access algorithms for problems outside of TDA."