Commits


ptaylor authored and Brian Hulette committed 09c535c629a
ARROW-5115: [JS] Add Vector Builders and high-level stream primitives This PR adds Vector Builder implementations for each DataType, as well as high-level stream primitives for Iterables/AsyncIterables, node streams, and DOM streams. edit: I've created a demo that transforms a CSV file/stream of JSON rows to an Arrow table in this repository: https://github.com/trxcllnt/csv-to-arrow-js #### Builder API The new `Builder` class exposes an API for sequentially appending (or setting into slots that have already been allocated) arbitrary JavaScript values that will be flushed to the same underlying Data chunk. The `Builder` class also supports specifying a list of null-value sentinels, or values that will be interpreted to indicate "null" should be written to the null bitmap instead of being written as a valid element. Similar to the existing `Vector` API, `Builder` has a static `Builder.new()` method that will return the correct `Builder` subclass instance for the supplied DataType. Since the `Builder` constructor takes an options Object, this method also takes an Object: ```typescript import { Builder, Utf8 } from 'apache-arrow'; const utf8Builder = Builder.new({ type: new Utf8(), nullValues: [null, 'n/a'] }); utf8Builder .append('hello') .append('n/a') // will be interpreted to mean `null` .append('world') .append(null); const utf8Vector = utf8Builder.finish().toVector(); console.log(utf8Vector.toJSON()); // > ["hello", null, "world", null] ``` The `Builder` class has two methods for flushing the pending values to their underlying ArrayBuffer representations: `flush(): Data<T>` and `toVector(): Vector<T>` (`toVector()` calls `flush()` and creates a `Vector` instance from the returned Data instance). Calling `Builder.prototype.finish()` will finalize a `Builder` instance. After this, no more values should be written to the Builder instance. This is a no-op on for most types, except the `DictionaryBuilder`, which flushes its internal dictionary and writes the values to the `Dictionary` type's `dictionaryVector` field. #### Iterable and stream APIs Creating and using Builders directly is a bit cumbersome, so we provide some high-level streaming APIs for automatically creating builders, appending values, and flushing chunks of a certain size: ```typescript Builder.throughIterable(options: IterableBuilderOptions<T, TNull>) Builder.throughAsyncIterable(options: IterableBuilderOptions<T, TNull>) Builder.throughDOM(options: BuilderTransformOptions<T, TNull>) Builder.throughNode(options: BuilderDuplexOptions<T, TNull>) ``` #### Iterables and AsyncIterables The static `throughIterable` and `throughAsyncIterable` methods take an `options` argument that indicates the Builder's type and null-value sentinels, and returns a function which accepts an Iterable or AsyncIterable, respectively, of values to transform: ```typescript import { Chunked, Builder, Utf8 } from 'apache-arrow'; const options = { type: new Utf8(), nullValues: [null, 'n/a'] }; const buildUtf8 = Builder.throughIterable(options); const utf8Vector = Chunked.concat(...buildUtf8(['hello', 'n/a', 'world', null])); ``` The `options` argument can also specify a `queueingStrategy` and `highWaterMark` that control the chunking semantics: * If the `queueingStrategy` is `"count"` (or is omitted), then the returned generator function will flush the `Builder` and yield a chunk once the number of values that have been written to the Builder reaches the value supplied for `highWaterMark`, regardless of how much memory the `Builder` has allocated. * If the `queueingStrategy` is `"bytes"`, then the returned generator function will flush the `Builder` and yield a new chunk once the Builder's `byteLength` field reaches or exceeds the value supplied for `highWaterMark`, regardless of how many elements the `Builder` contains. #### Node and DOM Streams In addition to the Iterable transform APIs, we can also create node and DOM transform streams with similar options: ```typescript import { Readable } from 'stream'; import { toArray } from 'ix/asynciterable/toarray'; import { Chunked, Builder, Utf8 } from 'apache-arrow'; const options = { type: new Utf8(), nullValues: [undefined, 'n/a'], queueingStrategy: 'bytes', highWaterMark: 64, // flush each chunk once 64 bytes have been written }; const utf8Vector = Chunked.concat(await toArray( Readable .from(['hello', 'n/a', 'world', undefined]) .pipe(Builder.throughNode(options)) )); ``` #### Miscellaneous * Updates most dependencies, updates TypeScript to v3.5.1 (and resolves #4452) * Updates the BigInt compatibility type to use `Object.setPrototypeOf()`, yielding a 4x speedup * Updates Int64 and Uint64 set routines to accept native `bigint` types if available * Adds a docstring to the `_InternalEmptyPlaceholderRecordBatch` class added in #4373 Author: ptaylor <paul.e.taylor@me.com> Closes #4476 from trxcllnt/js/data-builders and squashes the following commits: 7998d2a7f <ptaylor> add createIsValidFunction example docstring 07fa443c4 <ptaylor> remove default dictionary hash function 8b0752f34 <ptaylor> fix possible AsyncRandomAccessFile race condition retrieving filehandle size 4d2a4f0b0 <ptaylor> regenerate flatbuffer source from current format schemas 05acad83d <ptaylor> fix minor row serialization issues to be compatible with console.table() c0f7f7bb2 <ptaylor> fix a few minor formatting issues in arrow2csv 9c3a865ab <ptaylor> ensure byteLength is calculated for offsets buffer ba755ad6b <ptaylor> use 53 bit hash fn to further avoid collisions 86593238d <ptaylor> add test for builder iterable byte queueing strategy 13b51db5a <ptaylor> print more details about each message f2508458c <ptaylor> use a better default dictionary builder hash function 3adf55550 <ptaylor> adds or updates most of the high-level Vector.from() methods to use the Vector Builders 5aea4f938 <ptaylor> Add more specific Int64 and Uint64 Builder tests e974444c6 <ptaylor> fix lint 5d4b0be03 <ptaylor> ensure bitmapbufferbuilder increments and decrements _popCount appropriately 4b5375ed4 <ptaylor> remove unnecessary jsdoc b2556b790 <ptaylor> add docstring for _InternalEmptyPlaceholderRecordBatch 5b990c364 <ptaylor> Clean up typedoc output, update typedoc to master to use typescript@3.4.5 c2673ac89 <ptaylor> add initial builder jsdoc 893c74f00 <ptaylor> ensure ListBuilder supports random insertion dddba1a7d <ptaylor> ensure variablewidthbuilder supports random insertion 84beb68cf <ptaylor> remove some property getters, clean up 0209e9d55 <ptaylor> update to typescript 3.5 c587c3392 <ptaylor> add BufferBuilders, clean up public Builder API 86a6cd062 <ptaylor> update closure compiler dependency d032e2dc5 <ptaylor> update dependencies fd461632a <ptaylor> fix nan checks 26387c7d2 <ptaylor> update typescript, finish streaming builders, add comprehensive builder tests 79bfd4625 <ptaylor> fix node and dom builder streams 2b7a63c6e <ptaylor> update test types 0990344da <ptaylor> add Builder throughDOM and throughNode transform streams 15ac17c31 <ptaylor> use Object.setPrototypeOf to improve bn performance f7081908b <ptaylor> use safe BigInt64Array constructor f65165781 <ptaylor> update typescript, ts-jest, jest a3072731f <ptaylor> fix readable-stream detection in firefox 4c2ef42fe <ptaylor> add the rest of the builder types 9ba996556 <ptaylor> update row type inference sanity check f8760cbed <ptaylor> enumerate each type key individually now that they're a real thing 70377a1cc <ptaylor> add vectorname getter to chunked for completion d19314156 <ptaylor> add helper method to return the stride for a datatype 385fee34b <ptaylor> fix typo 2f36741e2 <ptaylor> move stream methods to io folder fb2c9a235 <ptaylor> cleanup 1398ebd97 <ptaylor> don't clone Dictionary DataType instances in Schema assign to preserve pointer to original instance f7fe8c1a5 <ptaylor> update builder buffer padding 2b7b99228 <ptaylor> ensure union typeids buffer is an Int8Array 6e64c408e <ptaylor> ensure builder allocates aligned buffers, null bitmaps initialized to null, add Int64 builder tests cf5f77879 <ptaylor> return signed or unsigned 53bit integers 8a5c713e9 <ptaylor> show a better error message if gulp fails 6b7a20ea9 <ptaylor> ensure 64-bit int builders use BN to check for nulls c609266fe <ptaylor> fix bool builder, add primitive builder tests 2f423ec36 <ptaylor> fix date builder tests 6ebaa36ee <ptaylor> WIP streaming data builders 9e5e3ead1 <ptaylor> fix bool builder, add primitive builder tests ff32cbb59 <ptaylor> fix date builder tests 738fabb2a <ptaylor> WIP streaming data builders