Python Protobuf changes
- 4 minutes read - 674 wordsPython’s Protocol Buffers code-generation using protoc
has had significant changes that can cause developers… “challenges”. This post summarizes my experience of these mostly to save me from repreatedly recreating this history for myself when I forget it.
- Version change
- Generated code change
- Implementation Backends
I’ll use this summarized table of proto
and the Pypi library’s history in this post. protoc
refers to the compiler that supports code-generation in multiple languages. protobuf
refers to the corresponding Python (runtime) library on Pypi:
Date | protoc |
protobuf |
Note |
---|---|---|---|
2023-Jun | v23.2 | 4.23.2 | |
2022-Mar | v3.20.0-rc1 | 4.0.0rc1 | Pypi library “yanked” (don’t use) |
2022-Jan | v3.19.4 | 3.19.4 |
I’ll use this minimal Protocol Buffer example too:
syntax = "proto3";
message Foo {
string id = 1;
}
Documentation: Version Support
The linked page explains the change.
Previously, the protoc
release e.g. v3.19.4 matched the Pypi release e.g. 3.19.4.
Now, the protoc
release e.g. v23.2 maps to 4-something i.e. 4.23.2.
Generated code
One marked difference between Protocol Buffers generated code (“stubs”) for Python with other languages (Golang, Java, Rust etc.) is that the Python stubs are Metaclasses that require the runtime library (protobuf
). This is explained briefly in Python Generated Code Guide:
The Python Protocol Buffers implementation is a little different from C++ and Java. In Python, the compiler only outputs code to build descriptors for the generated classes, and a Python metaclass does the real work. This document describes what you get after the metaclass has been applied.
This is why, when you generate Python stubs, the result is somewhat impenetrable:
DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'...')
_globals = globals()
_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals)
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'foo_pb2', _globals)
if _descriptor._USE_C_DESCRIPTORS == False:
DESCRIPTOR._options = None
DESCRIPTOR._serialized_options = b'...'
_globals['_FOO']._serialized_start=13
_globals['_FOO']._serialized_end=30
Whereas other languages’ stubs are more evident:
Go:
package protos
type Foo struct {
Id string `protobuf:"bytes,1,opt,name=id,proto3" json:"id,omitempty"`
}
Java:
public interface FooOrBuilder extends
com.google.protobuf.MessageOrBuilder {
/**
* <code>string id = 1;</code>
* @return The id.
*/
java.lang.String getId();
/**
* <code>string id = 1;</code>
* @return The bytes for id.
*/
com.google.protobuf.ByteString
getIdBytes();
}
Fortunately, protoc
will generate Python Interface (PYI) Definition files for protobufs (only) that will give you intellisense if you use e.g. Visual Studio Code with a Python Language Server (e.g. Pylance).
The documentation shows you can:
protoc \
--proto_path=${PWD} \
--python_out=pyi_out:${PWD} \
${PWD}/*.proto
But I tend to:
protoc \
--proto_path=${PWD} \
--python_out=${PWD} \
--pyi_out=${PWD} \
${PWD}/*.proto
NOTE There’s an outstanding issue to add this PYI generation to gRPC.
Generated code change
The Python release notes for protoc
v3.20.0-rc1 explains:
Protobuf python generated codes are simplified. Descriptors and message classes’ definitions are now dynamic created in internal/builder.py. Insertion Points for messages classes are discarded.
This protoc
version is referenced because its release notes explain the change. The Pypi protobuf
release corresponding to this version was “yanked” and should not be used.
Using e.g. protoc
v23.2 generates:
from google.protobuf import descriptor as _descriptor
from google.protobuf import descriptor_pool as _descriptor_pool
from google.protobuf import symbol_database as _symbol_database
from google.protobuf.internal import builder as _builder
_sym_db = _symbol_database.Default()
DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'...')
_globals = globals()
_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals)
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'foo_pb2', _globals)
if _descriptor._USE_C_DESCRIPTORS == False:
DESCRIPTOR._options = None
_globals['_FOO']._serialized_start=13
_globals['_FOO']._serialized_end=18
Using e.g. protoc
v3.19.4 generates
from google.protobuf import descriptor as _descriptor
from google.protobuf import message as _message
from google.protobuf import reflection as _reflection
from google.protobuf import symbol_database as _symbol_database
_sym_db = _symbol_database.Default()
DESCRIPTOR = _descriptor.FileDescriptor(
name='foo.proto',
package='',
syntax='proto3',
serialized_options=None,
create_key=_descriptor._internal_create_key,
serialized_pb=b'...'
)
_FOO = _descriptor.Descriptor(
name='Foo',
full_name='Foo',
filename=None,
file=DESCRIPTOR,
containing_type=None,
create_key=_descriptor._internal_create_key,
fields=[
],
extensions=[
],
nested_types=[],
enum_types=[
],
serialized_options=None,
is_extendable=False,
syntax='proto3',
extension_ranges=[],
oneofs=[
],
serialized_start=13,
serialized_end=18,
)
DESCRIPTOR.message_types_by_name['Foo'] = _FOO
_sym_db.RegisterFileDescriptor(DESCRIPTOR)
Foo = _reflection.GeneratedProtocolMessageType('Foo', (_message.Message,), {
'DESCRIPTOR' : _FOO,
'__module__' : 'foo_pb2'
})
_sym_db.RegisterMessage(Foo)
Implementation Backends
There are 3 different implementations of Python protobuf. The Pypi library protobuf
includes all three and you can choose which implementation to use with PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION
environment variable.
The default implementation is now upb
(technically μpb
and written in C). The previous implementation was cpp
(written in … wait for it… C++) but this is deprecated. However, it is useful for Sharing Messages Between Python and C++. There’s a pure Python implementation too though this is not recommended.