The goal of pypi-transparency is very similar to the underlying motivation for the Golang team’s Checksum Database (also built with Trillian).
Even though, PyPi provides hashes of the content of packages it hosts, the developer must trust that PyPi’s data is consistent. One ambition with pypi-transparency is to provide a companion, tamperproof log of PyPi package files in order to provide a double-check of these hashes.
It is important to understand what this does (and does not) provide. There’s no validation of a package’s content. The only calculation is that, on first observation, a SHA-256 hash is computed of the package’s content and the hash is recorded. If the package is subsequently altered, it’s very probable that the hash will change and this provides a signal to the user that the package’s contents has changed. Because pypi-transparency uses a tamperproof log, it’s very difficult to update the hash recorded in the tamperproof log, to reflect this change. Corrolary: pypi-transparency will record the hashes of packages that include malicious code.
It is common to define a Python solution’s dependencies using a
requirements.txt file and using e.g.
pip to pull (recursively) the dependencies defined by
requirements.txt. For convenience, pypi-transparency currently depends upon a local cache of Python packages, in part as this provides it with definitive package filenames.
For example, the following
Corresponds to the following (uniquely-defined) files for my Python 3.7 environment on a Linux 64-bit OS:
There are many download files available for most package*version combinations: not only wheels and sources, but also and variants for different OSs etc.
These filenames correspond to the following PyPi URLs:
It’s non-trivial (and thus avoided by pypi-transparency) to convert package name*version combinations into filenames and URLs. Instead pypi-transparency leverages
pip download to do this heavy-lifting and then uses the cached files for its processing.
Given a filename (whl, zip etc.), it’s possible to determine the PyPi URL. The PyPI URL is prefixed with
https://files.pythonhosted.org/packages/. The prefix is followed by a BLAKE2 64-hex digit representation of the 256-bit hash of the package’s contents. This is then terminated with the filename.
PyPi shards the BLAKE2 hash into two 2-digits subdirectories followed by the remaining 60 (60+2+2==64) digits followed by the filename.
The BLAKE2 hash of the grpcio v1.23.0 package (
grpcio-1.23.0-cp27-cp27mu-manylinux1_x86_64.whl) pulled by
pip download to my machine is
d6c365db90ec27181edf491c26aa998ae631e50cd1f04ee8d8d513a95e3937f3 and so this becomes
d6/c3/65... in the package’s PyPi URL.