Commits


ptaylor authored and Wes McKinney committed 5598d2f4257
ARROW-2828: [JS] Refactor Data, Vectors, Visitor, Typings, build, tests, dependencies It's the big one; The Great ArrowJS Refactor of 2018. Thanks for bearing with me through yet another huge PR. [Check out this sweet gif](https://user-images.githubusercontent.com/178183/50551046-19a94d00-0c30-11e9-80ed-74b9290e8c49.gif) of all the new features in action. With streaming getting to a good place, we've already started working on demos/integrations with other projects like [uber/deck.gl](https://github.com/Pessimistress/deck.gl/tree/a5940e20cb1659a44cba7839082b0803a997a12f/test/apps/arrow) :tada: ### The JIRAs In addition to everything I detail below, this PR closes the following JIRAs: * [ARROW-2828](https://issues.apache.org/jira/browse/ARROW-2828): Refactor Vector Data classes * [ARROW-2839](https://issues.apache.org/jira/browse/ARROW-2839): Support whatwg/streams in IPC reader/writer * [ARROW-2235](https://issues.apache.org/jira/browse/ARROW-2235): Add tests for IPC messages split across multiple buffers * [ARROW-3337](https://issues.apache.org/jira/browse/ARROW-3337): IPC writer doesn't serialize the dictionary of nested Vectors * [ARROW-3689](https://issues.apache.org/jira/browse/ARROW-3689): Upgrade to TS 3.1 * [ARROW-3560](https://issues.apache.org/jira/browse/ARROW-3560): Remove @std/esm * [ARROW-3561](https://issues.apache.org/jira/browse/ARROW-3561): Update ts-jest * [ARROW-2778](https://issues.apache.org/jira/browse/ARROW-2778): Add Utf8Vector.from * [ARROW-2766](https://issues.apache.org/jira/browse/ARROW-2766): Add ability to construct a Table from a list of Arrays/TypedArrays ### The stats The gulp scripts have been updated to parallelize as much as possible. These are the numbers from my Intel Core i7-8700K CPU @ 3.70GHz × 12 running Ubuntu 18.04 and node v11.6.0: ```sh $ time npm run build [22:11:04] Finished 'build' after 39 s real 0m40.341s user 4m55.428s sys 0m5.559s ``` ```sh $ npm run test:coverage =============================== Coverage summary =============================== Statements : 90.45% ( 4321/4777 ) Branches : 76.7% ( 1570/2047 ) Functions : 84.62% ( 1106/1307 ) Lines : 91.5% ( 3777/4128 ) ================================================================================ Test Suites: 21 passed, 21 total Tests: 5644 passed, 5644 total Snapshots: 0 total Time: 16.023s ``` ### The fixes * `Vector#indexOf(value)` works for all DataTypes * `Vector#set(i, value)` now works for all DataTypes * Reading from node streams is now fully zero-copy * The IPC writers now serialize dictionaries of nested Vectors correctly (ARROW-3337) * DictionaryBatches marked as `isDelta` now correctly updates the dictionaries for all Vectors that point to that dictionary, even if they were created before the delta batch arrived * A few `arrow2csv` fixes: * Ignore `stdin` if it's a TTY * Now read all the Arrow formats from `stdin` * Always show the `help` text when we don't understand the input * Proper backpressure support to play nicely with other Unix utilities like `head` and `less` * [Fixes an unfiled bug](https://github.com/trxcllnt/arrow/commit/070ec9809a9f5822d62268252d0570366ec40883) we encountered last week where JS would throw an error creating RowProxies for a Table or Struct with duplicate column names ### The upgrades * New zero-copy Message/RecordBatchReaders! * [`RecordBatchReader.from()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/from-inference-tests.ts#L37) will peek at the underlying bytes, and return the correct implementation based on whether the data is an Arrow File, Stream, or JSON * [`RecordBatchFileReader`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/file-reader-tests.ts#L74) now supports random-access seek, enabling more efficient web-worker/multi-process workflows * [`RecordBatchStreamReader`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-dom-tests.ts#L119) can now read multiple tables from the same underlying socket * `MessageReader` now [guarantees/enforces](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/ipc/message.ts#L126) message body byte alignment (this one even surfaced bugs in [node core](https://github.com/nodejs/node/issues/24817) and the [DOM streams polyfill](https://github.com/MattiasBuelens/web-streams-polyfill/issues/3)) * New RecordBatchWriters * Adds RecordBatchJSONWriter, RecordBatchFileWriter and RecordBatchStreamWriter * Adds static `RecordBatchWriter.writeAll()` method to easily write a Table or stream of RecordBatches * Both sync and async flushes based on the WritableSink * Full integration with platform I/O primitives * We can still synchronously read JSON, Buffers, `Iterable<Buffer>`, or `AsyncIterable<Buffer>` * In node, we can now read from any [`ReadableStream`](https://nodejs.org/docs/latest/api/stream.html#stream_class_stream_readable), [`fs.FileHandle`](https://nodejs.org/docs/latest/api/fs.html#fs_class_filehandle) * In the browser, we can read from any [`ReadableStream` or `ReadableByteStream`](https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream), or the [`Response`](https://developer.mozilla.org/en-US/docs/Web/API/Response) returned from the `fetch()` API. (Wrapping the [FileReader](https://developer.mozilla.org/en-US/docs/Web/API/FileReader) is still todo) * We also [accept Promises](https://github.com/Pessimistress/deck.gl/blob/a5940e20cb1659a44cba7839082b0803a997a12f/test/apps/arrow/loader.js#L20) of any of the above * New convenience methods for integrating with node or DOM streams * [`throughNode()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-node-tests.ts#L54)/[`throughDOM()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-dom-tests.ts#L50) * [`toReadableNodeStream()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-node-tests.ts#L69)/[`toReadableDOMStream()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-dom-tests.ts#L65) * [`pipe()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/writer/streams-node-tests.ts#L91)/[`pipeTo()`/`pipeThrough()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/writer/streams-dom-tests.ts#L92) * Generic type parameters inherited from `DataType` now flow recursively ```js const table = Table.from<{ str: Utf8, i32: Int32, bools: List<Bool> }>(data); table.get(0); // will be of type { str: string, i32: number, bools: BoolVector } ``` * New simplified [`Data` class](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/data.ts) * New simplified, faster `Visitor` class with support for optional, more narrow [`visitT` implementations](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/visitor.ts#L181) * New specialized Visitor implementations to enable runtime reflection (e.g. dynamically lookup the Vector constructor for a given DataType) * New abstract `Chunked` base class for the applicative (concat) operation * public `chunkedInst.chunks` field is the list of inner chunks * New `Column` class extends `Chunked`, combines `Field` with the chunks (provides access to the field `name` from the Schema) * `RecordBatch#concat(...batchesOrTables)` now returns a Table * Table now extends `Chunked`, so it inherits: * `Table#slice(from, to)` * `Table#concat(...batchesOrTables)` * `Table#getChildAt(i)` exists, alias of `getColumnAt(i)` * `Table#getColumn[At]()` returns a Column ### The breaking changes * All the old IPC functions are gone, but the new APIs will live for much longer * `Table#batches` is now `Table#chunks`, which it inherits from `Chunked` (maybe controversial, open to aliasing) * `Table#batchesUnion` is now just... the Table instance itself (also maybe controversial, open to aliasing) * `DataType#TType` is now `DataType#typeId` -- it should have always been this, was a typo. Easy to alias if necessary. * The complicated View classes are now gone, logic centralized as specialized [`Visitors`](https://github.com/trxcllnt/arrow/tree/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/visitor) ### The tests * **Tests no longer rely on any C++ or Java generated integration files** * Integration tests have been moved into `bin/integration.js`, and they finish much quicker * The tsconfig files have been tweaked to speed up test run time and improve the async debugging experience * A streaming `RecordBatchJSONWriter` has been implemented so we can easily debug and validate written output * The JSON results are also tested against the corresponding binary representation, similar to the integration tests * A [suite of test-data helpers](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/generate-test-data.ts) have been added to auto-generate data for validation at runtime * They produce the underlying Arrow VectorData buffers, as well as the expected plain-JS-value representation [for verification](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/unit/generated-data-tests.ts#L23) * This allows us to test all possible type configuration combinations, e.g. [all types Dictionary-encode](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/data/tables.ts#L61), all types serialize when nested, etc. * A [suite of IO test helpers](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/unit/ipc/helpers.ts#L36) has been added * We use [`memfs`](https://www.npmjs.com/package/memfs) to mock the file system, which contributes to test performance improvements * This enables us to [easily test](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/unit/ipc/reader/file-reader-tests.ts#L38) all the flavors of io primitives across node and browser environments * A vscode debugging launch configuration has been added to ease the process of contributing more tests (and because I've been asked for mine so often) ### The build * Faster * Node 11+ (needs `Symbol.asyncIterator` enabled) * Closure-compiler upgrades and build enhancements mean we can auto-generate the externs file during compilation, rather than maintaining it by hand ### Misc * Added `arrow2csv` to `js/bin/arrow2csv`, so anybody with the JS project dependencies installed can easily view a CSV-ish thing (`cat foo.arrow | js/bin/arrow2csv.js`) ### Todos * Docs/Recipes/Examples * Highlight/write more tools (like `arrow2csv`) * Flesh out the RecordBatchWriters a bit more * Gather feedback on the new RecordBatchReader APIs Author: ptaylor <paul.e.taylor@me.com> Author: Paul Taylor <paul.e.taylor@me.com> Closes #3290 from trxcllnt/js-data-refactor and squashes the following commits: 2ef150f7e <ptaylor> bind getByteWidth to the vector type 9acfaa367 <ptaylor> handle the case where collapsed Uint8Arrays fully overlap 6a97ee09f <ptaylor> perf: defer creating rowProxy on nested types, use Array instead of Object for creating Data instances 2cad76065 <ptaylor> pipe directly to stdout to ensure backpressure is preserved f006a2681 <ptaylor> ensure schema and field always have a metadata map 8dc5d2cbf <ptaylor> fix Float64 Array typings 162c7d873 <ptaylor> fix arrow2csv left-pad measurement for new bignum/decimal output 64dc01519 <ptaylor> teach closure about Symbol.toPrimitive ca0db9e2a <ptaylor> fix lint ec12cdd18 <ptaylor> add a small BigNum mixin to make working with Int64 and Decimal values a bit easier 62578b93e <ptaylor> fix bug where valueToString function would return undefined (JSON.striingify(undefined) === undefined) 4b58bde06 <ptaylor> fix visitor method overload type signatures d16541335 <ptaylor> don't print comma that includes system paths 708f1b4e7 <ptaylor> move stride to data, fix chunked slicing, remove intermediate binding and getters in favor of direct property accesses 78ecc4cfd <ptaylor> use the textencoders from the global instead of Buffer for perf testing 47f0677bf <ptaylor> perf: use a closure instead of binding 380dbc7de <ptaylor> add a single-chunk column type 6bcaad6ac <ptaylor> fix lint f7d2b2ef2 <ptaylor> add getters for the dictionary and indices of chunked dictionary vectors aaf42c8a9 <Paul Taylor> Consolidated JS data handling refactor