Pyarrow interop broken after Thrift2 migration
I expect to be able to write a parquet file with Parquet2.jl and read it with Python (Pandas + PyArrow).
Consider writing a parquet file:
using DataFrames
using Parquet2
jldata = DataFrame("A" => 1:5, "B" => 6:10)
Parquet2.writefile("fromjulia.parquet", jldata)
and then reading it via Python/Pandas/PyArrow:
using CondaPkg
CondaPkg.add("pandas")
CondaPkg.add("pyarrow")
using PythonCall: pyimport
pd = pyimport("pandas")
pydata = pd.read_parquet("fromjulia.parquet")
This all works just fine with
[992eb4ea] CondaPkg v0.2.18
[a93c6f00] DataFrames v1.5.0
⌃ [98572fba] Parquet2 v0.2.14
[6099a3de] PythonCall v0.9.13
However, after updating Parquet2...
[98572fba] Parquet2 v0.2.15
... the Python/Pandas/PyArrow read results in a thrift-related OSError:
ERROR: Python: OSError: Could not open Parquet input source '<Buffer>': Couldn't deserialize thrift: TProtocolException: Invalid data
Python stacktrace:
[1] pyarrow.lib.check_status
@ pyarrow/error.pxi:115
[2] pyarrow.lib.pyarrow_internal_check_status
@ pyarrow/error.pxi:144
[3] pyarrow._dataset.Fragment.physical_schema.__get__
@ pyarrow/_dataset.pyx:1345
[4] __init__
@ pyarrow.parquet.core ~/projects/_etc/julia/julia-mwe/Parquet2/thrift2/.CondaPkg/env/lib/python3.11/site-packages/pyarrow/parquet/core.py:2479
[5] read_table
@ pyarrow.parquet.core ~/projects/_etc/julia/julia-mwe/Parquet2/thrift2/.CondaPkg/env/lib/python3.11/site-packages/pyarrow/parquet/core.py:2939
[6] read
@ pandas.io.parquet ~/projects/_etc/julia/julia-mwe/Parquet2/thrift2/.CondaPkg/env/lib/python3.11/site-packages/pandas/io/parquet.py:227
[7] read_parquet
@ pandas.io.parquet ~/projects/_etc/julia/julia-mwe/Parquet2/thrift2/.CondaPkg/env/lib/python3.11/site-packages/pandas/io/parquet.py:509
Stacktrace:
[1] pythrow()
@ PythonCall ~/.julia/packages/PythonCall/1f5yE/src/err.jl:94
[2] errcheck
@ ~/.julia/packages/PythonCall/1f5yE/src/err.jl:10 [inlined]
[3] pycallargs(f::PythonCall.Py, args::PythonCall.Py)
@ PythonCall ~/.julia/packages/PythonCall/1f5yE/src/abstract/object.jl:210
[4] pycall(::PythonCall.Py, ::String, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ PythonCall ~/.julia/packages/PythonCall/1f5yE/src/abstract/object.jl:228
[5] pycall(::PythonCall.Py, ::String, ::Vararg{Any})
@ PythonCall ~/.julia/packages/PythonCall/1f5yE/src/abstract/object.jl:218
[6] (::PythonCall.Py)(::String, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ PythonCall ~/.julia/packages/PythonCall/1f5yE/src/Py.jl:341
[7] (::PythonCall.Py)(::String, ::Vararg{Any})
@ PythonCall ~/.julia/packages/PythonCall/1f5yE/src/Py.jl:341
[8] top-level scope
@ REPL[14]:1
I'm not sure if this is an issue with Parquet2.jl, Thrift2.jl, or PyArrow. I'm happy to try to help with debugging. For reference, I'm very experienced with Python and pretty green with Julia.