Details
-
Bug
-
Status: Open
-
P2
-
Resolution: Unresolved
-
2.37.0
-
None
-
None
Description
I am reading data from a parquet and one of the columns is a Nullable Integer (https://pandas.pydata.org/docs/user_guide/integer_na.html#integer-na)
Not 100% sure I correctly declared it:
import typing from typing import Dict, Iterable, List, Optional import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions class Record(typing.NamedTuple): port: Optional[int] #port: str recFields=set([i for i in Record.__dict__.keys() if i[:1] != '_']) beam.coders.registry.register_coder(Record,beam.coders.RowCoder) def extractDF(tuple): df=tuple[1].to_pandas() print(type(df.port.dtype)) return df input_patterns = ['data/*.parquet'] #local runner options = PipelineOptions(flags=[], type_check_additional='all') def toRecords(df): #df["port"]=None return df.to_dict('records') with beam.Pipeline(options=options) as pipeline: lines = (pipeline | 'Create file patterns' >> beam.Create(input_patterns) | 'Read Parquet files' >> beam.io.ReadAllFromParquetBatched(columns=recFields,with_filename=True) | 'Extract DF' >> beam.Map(extractDF ) | 'To dictionaries' >> beam.FlatMap(toRecords) | 'ToRows' >> beam.Map(lambda x: Record(**x)).with_output_types(Record) | "print">> beam.Map(print))
This fails with an type error.
When I uncomment the line in toRecords to set everything to None it works fine.