Commits


Joris Van den Bossche authored and Wes McKinney committed a1eb81b02a5
ARROW-5220: [Python] Specified schema in from_pandas also includes the index https://issues.apache.org/jira/browse/ARROW-5220 As I mentioned in the issue, while going down this path, quite some questions came up. So assume we start expecting the index columns also to be present in the schema, if specified: - are we OK with erroring if the index is not in the schema but would be written as a column? And only if `preserve_index=True`, or also with `preserve_index=None` in case the index is not a RangeIndex ? This will break some current usage (TODO but can probably do it with a deprecation first) - We should follow the order of the columns in the schema, also for the index? (currently the index is always appended to the other columns) -> I think yes - What if an index is specified in the schema but `preserve_index=False` ? -> currently raise an error - What if there are multiple index levels (a MultiIndex), but only one of them is specified in the schema? (in case of columns, then that column that is not the in the schema is ignored) -> currently only select what is in the schema - What if the index is specified in the schema, but is actually a RangeIndex which would otherwise be serialized as metadata? -> currently raise an error (the user can do `preserve_index=True` to prevent this), but could also include it as a column instead of metadata in this case In general, though, I think it would be good to have the rule: if a `schema` is specified, it is the single source of truth about the schema, and you can be 100% sure that the resulting table will have this exact schema (otherwise an error is raised) Closes #5379 from jorisvandenbossche/ARROW-5220-from-pandas-schema and squashes the following commits: ee76e67fc <Joris Van den Bossche> ARROW-5220: Specified schema in from_pandas also includes the index Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>