SQR-054: Moving RSP Interactive Notebook containers to conda

  • Adam Thornton

Latest Revision: 2021-03-03

We have typically used pip to install supplementary software into the “stack” environment, and maintained a separate “system” environment. This has grown increasingly fragile, and now that the rubinenv work has landed in the stack containers, we can use conda and avoid a great deal of pollution of the stack.

1   The Status Quo

Historically, the conda tool has been unable to solve the Science Pipelines stack environment. This has meant that we have had no choice but to add packages to the stack with pip. Further, it has historically been the case that the enviroment available in the stack has often been far behind the current state of affairs, which in past years caused us trouble with getting JupyterLab and its requisites working.

2   The New rubinenv World

However, as of late January 2021, the rubinenv conda environment has landed and is available in weekly builds of the stack. It does (eventually) solve with conda. Therefore, since the Science Pipelines stack environment can be extended with conda, we should be doing as much addition to it as possible with conda, reserving pip for items that are not packaged in conda at all (or, for instance, if we temporarily need something ahead of a released version but available on GitHub).

2.1   One environment or two?

There are arguments to be made that we should preserve the stack environment intact, and instead of adding anything to it, clone the environment and add the Jupyter and Jupyter-adjacent packages to the clone.

This works fine. However, because the container uses an overlay filesystem to build additional layers, all the hardlinks that conda installs to save space turn into copies, so adding a clone of the scipipe-env environment increases the container size by about 60%. This may not be an acceptable trade-off, given that it is not at all clear that the interactive analysis environment should have the pipeline batch environment available.

2.2   Goodbye to “system” python

However, once we have moved to using conda for maintenance of the stack-based environment, it really makes no sense to continue with the same system python setup we had.

In order to make system python work at all, we needed to duplicate a huge amount of functionality from within the stack environment. Most notably that included a C compiler, Cmake, a modern git, and git-lfs.

To achieve parity between the environments, we would typically do (in the container build), something like:

pip install <stuff>; activate stack environment; pip install <stuff>

This obviously does not work if the “system” python uses pip and the “stack” python uses conda.

There are also issues, such as one found on 2 February 2021, where the “system” python is at 3.6.8; there was, as of that date, no 3.8+ Python for Centos 7 available via EPEL. In order to get an IPython current enough to support current jedi, we would therefore have had to retool the build to use Python 3.8 from an SCL. Thus, doing nothing and maintaining the status quo is really not an option: it will require at least as much effort as switching to the proposed solution below.

3   The Proposal

The obvious solution is to get rid of the “system” python. At present, all of the additional components we require work quite nicely with the Python environment within the stack. The only actual change (rather than supplementation) we have to do to the things packaged with the stack is a downgrade of dask, and that is simply because dask 2021.1 does not play well with the Kubernetes cluster driver, but 2020.12 does. Once dask addresses this (as surely they must), then we can remove that downgrade.

3.1   What if the Rubin environment stops working?

If we do run into an issue where something we require means that we are in conflict with major parts of the Science Pipelines stack, then all we need to do is go with the “two environments” solution proposed above. While this comes with an increase in the container size, it gives us the flexibility to either start from a clone of the stack environment or start from a new environment and add just what we need. The second was essentially what we were doing in the pip world.

3.2   In any event

We cannot feasibly maintain two python environments built around two different packaging systems. While the author is not a fan of conda, it’s what the stack uses, and the rubinenv environment actually manages to track both conda-installed packages and those installed into the conda environment with pip reasonably well (indeed, except for things installed from pip via git repositories, conda itself can install all of the pip packages, which is something we take advantage of in our pinned builds).

3.3   User-installed software

Allowing user-installed software will functionally be the same as it is in the current regime: pip install --user will continue to work, but users will also have the option to use conda env create or conda env clone; their installed packages will live under their home directories. This can of course create issues as users have packages installed that are incompatible with later versions of the stack; this is why we offer the clear .local checkbox on the JupyterHub spawner form.

4   Pinned “prod” builds

In the past, our “bleed” builds have been floating, using the newest versions of packages available, while, because the stack has been pinned to specific versions, we also produce a separate “prod” build–which is what is used for daily, weekly, and release images. This requires a manual reconciliation every so often of pins from bleed into prod.

That has not been an insane amount of work, but it has been getting harder over time, and indeed has not been possible (at least not without significant effort) since December 2020.

The existence of rubinenv means, however, that the stack itself is not fully pinned. See, for instance, https://lsstc.slack.com/archives/C4RKBLK33/p1612202947357200?thread_ts=1612202910.356800&cid=C4RKBLK33 , specifically Eli Rykoff’s statement “In the new conda unpinned env, it depends on what the solver hands us. And that can change from day to day as new versions are released.”

If that is the case, perhaps it’s time to drop the “prod” builds entirely, and just use conda to install our additional packages, since there does not seem to me to be any reason to be more zealous about exact pins than our upstream input container.

5   Rebuilding Historic Images

In general, we can currently overlay our own packages onto older versions of the stack. We don’t do that very often, because the time investment to do so is nontrivial and it hasn’t been requested much. In general, the stack and JupyterLab UI should work, but things like Dask and Bokeh, which are fairly tightly coupled to the underlying Python, may or may not.

In general, however, our practice has been to treat ancient images as immutable. This of course presents a problem in that, for instance, Release 13 will no longer work with our modern spawning and authentication framework. It may be worthwhile, as we approach operations, to determine how far back we want to support stack releases and spend some time modernizing the containers for release builds only, so that they all run with whatever the current launching framework at the beginning of operations is. Clearly this effort should not be made for weekly or daily builds; however, this is almost self-correcting in that daily builds are purged after about a month, and weekly builds after a year and a half. Only release builds persist forever.

When we adopt the move to the conda-based world, we will lose the ability to rebuild images older than rubinenv since the conda solve will not work. I propose we tag the final pip-based version and use that if at some point we have to rebuild a version from before rubinenv landed.

Note that we will not be rebuilding historic images for newly-discovered security vulnerabilities in the stack packages. The RSP by design provides its users with arbitrary code execution in the Notebook Aspect, so the rest of the infrastructure already needs to be secured against the notebook. Notebook environments will be run with restricted capabilities and privileges to limit their ability to attack the hosting infrastructure.

The scope of a security vulnerability in a historic image is therefore mostly limited to compromising the user’s notebook itself. Given the types of operations users are likely to perform with historic images (reproducing old results with a fixed version of the stack, not talking to malicious Internet sites or installing new, possibly-compromised software), this is an acceptable security risk given the important scientific objective of reproducibility of old results, which requires not upgrading software that’s part of the scientific stack.

6   Conclusion

We don’t have much choice here. Modifying individual pins has become fraught with danger, and the environment in the RSP is continuing to diverge from the upstream stack. This will only get worse.

It makes no sense to try to construct a stack-equivalent Python environment with pip; if the stack uses conda then the “system” Python, if any, should too.

At the moment, we can install our additional packages quite cleanly onto the stack with conda and therefore a single-environment container built with conda is still much closer to the input Science Pipelines stack environment than what we currently get by installing our packages with pip.

Thus, it’s my contention that we should collapse everything to just using the stack Python. If we run into something in the future where we need to separate the environment that runs Jupyterlab from the stack environment, we can clone the stack environment, or build a new conda environment from scratch, and run Jupyterlab in that environment. Right now that does not appear to be necessary, and the more nimble stack environment, combined with the slowdown of churn in Jupyterlab as it has matured, makes me hopeful that it will never be necessary.