Details
-
New Feature
-
Status: Open
-
P3
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Currently we can convert between a NamedTuple type and its Schema protos using named_tuple_from_schema and named_tuple_to_schema. I'd like to introduce a system to support additional types, starting with structured types like attrs, dataclasses, and TypedDict.
I've only just started digesting the code, but this task seems pretty straightforward. For example, I think the type-to-schema code would look roughly like this:
def typing_to_runner_api(type_): # type: (Type) -> schema_pb2.FieldType structured_handler = _get_structured_handler(type_) if structured_handler: schema = None if hasattr(type_, 'id'): schema = SCHEMA_REGISTRY.get_schema_by_id(type_.id) if schema is None: fields = structured_handler.get_fields() type_id = str(uuid4()) schema = schema_pb2.Schema(fields=fields, id=type_id) SCHEMA_REGISTRY.add(type_, schema) return schema_pb2.FieldType( row_type=schema_pb2.RowType( schema=schema))
The rest of the work would be in implementing a class hierarchy for working with structured types, such as getting a list of fields from an instance, and instantiation from a list of fields. Eventually we can extend this behavior to arbitrary, unstructured types.
Going in the schema-to-type direction, we have the problem of choosing which type to use for a given schema. I believe that as long as typing_to_runner_api() has been called on our structured type in the current python session, it should be added to the registry and thus round trip ok, so I think we just need a public function for registering schemas for structured types.
bhulette Did you want to tackle this or are you ok with me going after it?
Attachments
Issue Links
- is related to
-
BEAM-12955 Add support for inferring Beam Schemas from Python protobuf types
- Open
-
BEAM-13150 Integrate TFRecord/tf.train.Example with Beam Schemas and the DataFrame API
- Open