A Data Scientist’s “Sabbatical”: Skill-building through external scientific collaboration

(with the Frontier Development Lab, SETI Institute, Triullium USA, NASA, and Google Cloud)

Why go outside the company to build new skills

New problems, new solutions, fresh perspective

Insight Edge has a unique mission when compared to other DX-oriented startups: we target digital transformation among the overall Sumitomo Corporation group. That may sound like a limited purview -- after all, other Developer Experience and AI-themed startups may serve an entire market of industrial clients, as long as the project is mutually agreeable.

While perhaps counterintuitive, our focused mission is actually a source of a wide variety of projects. Our collaborative relationship with our parent SC's DXC means that our projects can take a longer view, rather than fret over meeting strict quarterly profit quotas. We may incur short-term costs, as long as the longer-term mission and output are producing growth for SC and our global operating companies. You can think of us as building DX momentum for the group, rather than trying to maximize profits per project.

With all of that being said, there are times when we, as data scientists and engineers at Insight Edge, could benefit from exploring problems outside the group. If all we ever look at are the obvious issues with which our partners are struggling, we could miss chances to be more proactive, inventive, and creative.

Part of how we have accomplished this outside grounding is by having a diverse talent pool from the start. IE's technical staff includes those with doctoral degrees in chemistry, planetary science, neuroscience, and even astronomy (that's me).

This gives our team a diverse and scientifically savvy background, but what's even better is the opportunity for us to reconnect with scientific-themed projects, or even better, novel challenges requiring a combination of professionals from ML and cloud engineering to physical science.

In addition, our members often engage in external events like exhibitions, conferences, and workshops to encounter new issues which at least, on-the-surface, are different from what the SC group is currently struggling with (see the SC Group home page for an overview of our global activities).

What is the Frontier Development Lab

The Frontier Development Lab is a public-private partnership whose planning is year-round. However, the actual R&D takes place during an intense, adventurous (often challenging) 2-month agile development sprint.

From Space to Sustainability

FDL started in 2016, and originally focused on challenges combing space science domain knowledge with data science and machine learning practical expertise. Project themes have branched out from space exploration into sustainability-related projects, such as earth observation, disaster prevention and management, and renewable energy.

Public-private partnership

FDL's core government-side partner has been NASA, with the SETI institute acting as a liaison between NASA and private partners, such as NVIDIA, Google Cloud, and others. As of 2022, the US Department of Energy began sponsoring FDL challenge teams, broadening the scope beyond just projects with a clear space connection.

Agile Development in Novel Situations

The agile development aspect of FDL is what has always really inspired me. Because here you have a way of working that meshes well with what industry-employed developers and engineers often face: face-paced, incremental agile dev and problem-solving cycles. And at the same time, this is a skill that is also much-needed in academia, where projects can tend to stagnate or be siloed into the domain of a few key experts' labs.

Astrobiology tools, and problems relevant to industrial data science

From Frontier Science Toward Frontier Industries

FDL has sometimes hosted challenges over the years that seem rather avant-guard. Probably at top of the list would be the "Astrobiology" team, which I was part of in 2018 and then again in 2022. You might think: "Wait, that has absolutely nothing to do with industry!" But, as a discipline, Astrobiology connects multiple domains and fundamental questions, including the challenge of understanding even the evolution of the most rudimentary systems of life and their theoretical constituent building blocks, to the future of our planet, and the search for signs of present or past life in our Solar System.

The Need for Interdisciplinarity

Personally speaking, working through the challenges of the Astrobiology project through FDL is analogous to the challenges I face in my industry job. At Insight Edge, we are constantly taking on new DX challenges from SC subsidiaries and business units from all sorts of domains, from manufacturing to energy grids --- and this has us always diving into new projects trying to learn how that domain handles their problems, how we can help, and how we can connect them to what we've learned from other domains, in completely different projects. This means the Insight Edge Engineers have to be somewhat interdisciplinary by nature.

Astrobiology is likewise one of the most interdisciplinary research topics one can address. Even if you build a team, like ours at FDL, with two "Astriobiolgists" -- they will most likely have two very different perspectives and skillsets, e.g. geophysics and bioinformatics. FDL is as much about solving a problem we're given as it is redefining the problem based on our team's talents.

Projects at Insight Edge are similar: given sometimes limited resources, we adapt what we can propose and how to accomplish it based on what talent we have available, on the actual situation on the ground. We have to build relationships with the folks actually working in and managing factories, for example. Likewise in FDL, in a very fast timeframe, we have to reach out to many different stakeholders, scientific collaborators, and data repositories in a very short timeframe, so that what we create at FDL is well-tuned to the actual scientific needs: we don't pretend to know everything about our challenge within our small team.

Space Science and AI

In recent years "AI" has become a kind of catch-all term for so many things. Sometimes the ways that FDL connects Space Science and AI can feel like Science Fiction. But we have more concrete reasons for looking to AI -- more specifically, autonomous spacecraft and robotics, and automated on-site sampling and data analysis techniques.

Difficulty of pre-training models for unexpected environments

Even the latest language models have difficulty being trained for all the different situations and contexts humans may throw at them. Now imagine you're training a model to explore new worlds: there is a huge risk that we can "overfit" such models to environments we know on Earth.

Rather than simply pre-training a classifier on known living organisms and biomolecues, we needed to find a way to both remain open to yet unknown biomolecular structures, but still have something that coud perform inference rapidly and efficienty.

Resource Limitations on Space Missions

What most AI news these days is talking about is usually deep learning. Very large models of data, with millions, even billions of parameters, that can map inputs to outputs. You might have also heard how incredibly expensive these models are to train. Even so, actually using a trained model is often feasible enough that we can deploy them as web apps on cloud services and GPU servers and so on, such that tech-savvy users are bound to run across these tools on a daily basis (your iPhone auto-categorizing your photos, a web page for anime-izing your homework.)

For reference, consider the computers onboard existing missions. The Mars Perserverance rover is equipped with only a single-core CPU, operating at only a few hundred MHz -- very conservative compared to even the standard environment of a Google Colab, or an NVIDIA GTX10 series budget gaming PC. Space hardware is heavily customized for specific environments, power constraints, and the need for redundancy in missions where hardware failure can mean the permamnent loss of a multi-million dollar mission.

While it's true future missions will likely have more computing power than Perserverance, which launched in 1998, it is unlikely they will keep pace with the environments that deep-learning developers take for granted. Any comutational techniques developed now for future missions need to take such constraints into acount.

Communication delays

Space is big. Mission time is often limited. Especially when you consider specific windows of opportunity, like the time between dust storms on Mars, etc. While many calculational or simulational tools exist for assessing samples and deciding on the next steps of an experiment, two major limitations exist: 1) not all tools, but many of them can be very computationally expensive, and the hardware onboard space probes are often very conservative (and sometimes several years old by the time the mission starts!). 2) Mid-mission assessment, experiment planning, and re-planning, typically require human intervention -- meaning a round-trip transmission lag from your favorite planet or moon.

Autonomous or at least semi-autonomous experiment guiding could save precious actual sampling time. Also, computationally expensive calculations and simulations could be "compressed" or emulated by a well-trained neural network, which could give us a balanced trade-off between accuracy, computational time, and hardware requirements.

Life Creates Complexity: How do we search for it

Agnostic detection approaches

If we could assume life elsewhere in the universe resembled Earth life, tests for specific molecules and reactions common among all known organisms might apply. But how can we make such an assumption, knowing the vast diversity of environments out there? What our FDL project assumes instead is that life could take on molecular forms we may not recognize. Instead, we consider that life, in whatever form, is likely to assemble simple building blocks into more complex ones.

A way of looking at this idea in more concrete terms, is to think in terms of molecular complexity: how likely is it that a sample of molecules detected on some celestial body formed without the involvement of life? Geological and atmospheric processes can definitely result in complex molecules, but statistically speaking, high abundances of increasingly complex molecules may be one of the most agnostic indications of the presence of living processes at work.

For the technical discussion, I'll rely mostly on quotes from our workshop abstract, presented at the ML for Physical Sciences workshop, at the 2022 NeurIPS conference in New Orleans. By all means, please have a read, and look into the references therein! This was a very collaborative work,

This paper was led by our team's ML-lead, Timothy (Timmy) Gebhard, and describes the work we carried-out over 8 weeks in summer 2022. Also on the core researcher team were Jaden ("J.J.") Hastings, who grounded our bioinformatics and space exploration goals; and Jian Gong, our primary geochemistry/geobiology domain expert, who also took the lead on our data strategy. Photo of FDL 2023 Astrobiology researcher team members, standing in front of the famous Drake Equation sign at the SETI Institute office in Mountain View, California. From left to right are Aaron Bell, Timmy Gebhard, JJ Hastings, and Jian Gong.

We go into the details of three commonly cited metrics for defining molecular complexity. I'll quote the descriptions below, but the main point is that it's not an easy question to answer, so we opted to explore multiple metrics instead of favoring a particular one: another approach that tends to appear in industrial data science projects, whenever tackling a novel industry or task.

Fundamentally, [molecular complexity, "MC"] measures are numeric features intrinsic to a molecule that represent an abstraction of its structure (or formation process) while also characterizing its information content. Intuitively, one may expect MC to increase with the molecular size, the multiplicity of bonds, or the presence of heteroatoms, while it should decrease with increasing symmetry (Randic, 2005). MC is usually not considered as an end in itself but used in a relative fashion to compare molecules or to characterize chemical reactions. Various definitions of MC have been proposed in the literature (see, e.g., the introduction of (Boettcher, 2016) for an overview), typically building on concepts from graph and information theory. In this work, we focus on the following three definitions:

1) Bertz complexity $C_T$ (Bertz, 1981): The first general index of molecular complexity. It combines concepts from graph and information theory and is defined as $C_T = C(\eta) + C(E)$, where $C(\eta)$ describes the bond structure and $C(E)$ the complexity due to heteroatoms. Calculating $C_T$ is fast and scales linearly with the molecule size. We compute $C_T$ using the BertzCT method from RDKit (The RDKit Team, Landrum et al.) 2) Böttcher complexity $C_m$ (Boettcher; 2016, 2017) An information-theoric measure that is based on the information content in the microenvironments of all atoms; it is additive and simple to calculate even for large molecules. Our computation of $C_m$ uses a freely available open source implementation (Boskovic et al, 2020). 3) Molecular Assembly index (MA) (Marshall et al., 2017): Also known as pathway complexity, the MA represents the minimum number of steps required to assemble a molecule from fundamental building blocks. MA is particularly well-suited to biosignature detection while being experimentally verifiable (Marshall, et al., 2021). It is at least as hard as NP-complete to compute (Liu, et al., 2021), requiring hundreds of CPU hours even for moderately sized molecules. We use a (currently non-public) implementation kindly provided by the authors of Marshall, et al. (2017).

While we chose to look at multiple metrics, we also found correlation often exists between them. I mention this primarily because of how often this occurs in industrial data science too! Often in our projects, we receive data sources that at least at first glance, appear to be highly correlated. Only when we explore higher dimensional features can we see which data is more informative.

We note that, to first order and at low molecular weights, these three measures are strongly correlated; the mass of the molecule acts as a confounder constraining the maximum complexity. When regressing out the mass, however, the correlation becomes less strong, and it becomes apparent that $C_T$, $C_M$, and MA each capture different aspects of the molecule.

Inferring complexity with Machine Learning

How to measure molecular complexity is not a settled matter (pun intended.) In our paper, we detail various approaches to train models on molecular complexity data, in hopes that we could eventually achieve a model that can -- at perhaps some accuracy costs -- very quickly infer scores that could otherwise take a very long time to calculate.

Figure 1 illustrates where this capability might fit into a broader, automated analysis pipeline of a rover or probe exploring another celestial body.

Dataset generation

While the above talk of molecules and complexities and life and space might already sound overwhelming, the truth of this project -- not unlike almost every project I have worked on in industry as well -- is that dataset curation was the main obstacle. As often occurs in data science, potentially relevant but fragmented or not well-organized datasets are all around. But actually putting together that data needed, and having it in both a format ready for ML, as well as to be able to describe it easily to stakeholders, is most of the battle.

The following text describes some of the technical particulars, but the main hurdle was acquiring mass-spectrometry data for all the molecules we wanted to explore. This is the practical crux of our challenge since mass-spectrometry is very likely the type of data that a future probe would actually sample.

As no ready-made dataset for our task---inferring MC from MS data---exists, we created our own. For this, we queried a public database (the NIST WebBook), retrieving all molecules below 1000 Dalton for which an MS was available. These molecules were then appended with other basic chemical properties, including our three MC metrics. Empirical MS data on NIST were taken using electron ionization at a $m/z$-resolution of 1 (standard MS as opposed to higher resolution tandem MS that employs a variety of techniques). This is approximately comparable to the target resolution of 0.4-3 of the DraMS instrument onboard the Dragonfly mission (Grubisic, 2021). Our final dataset consists of 17,021 unique molecules with associated MS, randomly split into a training set with 12,000 molecules and an evaluation set with 5,021. [...] The major limitation of the dataset generation was the computation of MA, taking 65,000+ CPU hours over hundreds of compute-optimized nodes on Google Cloud in parallel.

Prediction results

Apologies in advance for the spoiler: we did not detect aliens by the end of the summer. We did achieve a model that shows promising performance at predicting various molecular complexity scores! The following quote gives some technical details, described visually in the figure thereafter.

Unsurprisingly, all models outperform the naïve baseline (reducing the error by more than 50% in best case), and non-linear models perform better than the linear one. More interestingly, we find that there is a consistent trend across all models that MA is easier to predict than Böttcher complexity, which in turn is easier to predict than Bertz complexity (evidenced by respectively lower predicted errors). We speculate that this may have to do with the definition of the MA, which is, in a way, conceptually similar to the idea of mass spectrometry: The MA counts the number of steps to assemble a molecule from smaller pieces, while MS observers the patterns that emerge when a molecule is fragmented.

Of course, what really matters is how this is put into use, highlighting another issue in industry: a lot of projects end in PoCs. Often preliminary results, even promising, may not translate to actual performance. This can be for a variety of reasons: the PoC dataset was not representative of production, the stakeholders have limited appetite for production implementation, or the PoC itself did not connect to a real commitment to production (i.e. its main goal was to convince a manager or administrator to pay attention, or allocate resources in a certain way towards long-term goals.) In our case, the project is designed to demonstrate that scientific and space exploration bottlenecks may be resolved with agile science-tech development, and machine learning approaches. The actual implementation depends on so many different factors, but we are extremely hopeful that this work has demonstrated the problem-solving potential of machine learning toward finding new neighbors in our solar system.

FDL in 2023

FDL is an ongoing program, with the particular topics addressed varying each year. For the latest news, please see the FDL 2023 homepage! You'll find the full summary of results for all FDL 2022 teams here. As for me, I'll continue to actively hunt for life beyond earth (with my Insight Edge vacation days); and continue to look for ways to both contribute my industry knowledge beyond Insight Edge -- but also for ways that are extremely distant (pun intended, as usual) topics and challenges can help inspire my work within Insight Edge, and the Sumitomo Corporation Group.

References

References appearing in the quotes above are listed here, but you can find the full reference list for our project in the workshop paper linked before. Here it is again for convenience!

M. Randi´c, X. Guo, Plavši´c, and A. T. Balaban, “On the Complexity of Fullerenes and Nanotubes,” in Complexity in Chemistry, Biology, and Ecology, New York, NY, USA: Springer, pages 1–48. DOI: 10.1007/0-387-25871-x_1.
T. Böttcher, “An Additive Definition of Molecular Complexity,” Journal of Chemical Information and Modeling, volume 56(3): 462–470, 2016. DOI: 10.1021/acs.jcim.5b00723.
S. H. Bertz, “The first general index of molecular complexity,” Journal of the American Chemical Society, volume 103(12): 3599–3601, 1981. DOI: 10.1021/ja00402a071.
The RDKit Team (G. Landrum et al.), RDKit: Open-source cheminformatics. Online
Boskovic Research Group, bottchercomplexity, 2020. Online, Commit: a212f96.
S. M. Marshall, C. Mathis, E. Carrick, G. Keenan, G. J. Cooper, H. Graham, M. Craven, P. S. Gromski, D. G. Moore, S. Walker, and L. Cronin, “Identifying molecules as biosignatures with assembly theory and mass spectrometry,” Nature Communications, volume 12(1), 2021. DOI: 10.1038/s41467-021-23258-x.
S. M.Marshall, A. R. G. Murray, and L. Cronin, “A probabilistic framework for identifying biosignatures using Pathway Complexity,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, volume 375(2109): 20160342, 2017. DOI: 10.1098/rsta.2016.0342.
Y. Liu, C. Mathis, M. D. Bajczyk, S. M. Marshall, L. Wilbraham, et al., “Exploring and mapping chemical space with molecular assembly trees,” Science Advances, volume 7(39), 2021. DOI: 10.1126/sciadv.abj2465.
A.Grubisic, M. G. Trainer, X. Li, W. B. Brinckerhoff, F. H. van Amerom, et al., “Laser Desorption Mass Spectrometry at Saturn’s moon Titan,” International Journal of Mass Spectrometry, volume 470: 116707, 2021. DOI: 10.1016/j.ijms.2021.116707.

Acknowledgements

This work would not have been possible without the incredible support from our team's mentors: Atılım Günes Baydin, Kimberley Warren-Rhodes, Michael Phillips, G. Matthew Fricke, Nathalie Cabrol, and Scott Sandford; funding and support from Google Cloud, and our primary scientific stakeholder, the NASA Astrobiology Institute, represented by Mary Voytek who provided critical feedback on scientific milestones of the project.