etl-cli - npm
npm
Search<br>Sign UpSign In
etl-cli<br>0.1.34 • Public • Published 2 days ago<br>Readme<br>Code Beta<br>33 Dependencies<br>0 Dependents<br>156 Versions
Command line tool for etl pipelines.
Install globally to have etl available on the command line:
npm install etl-cli -g
Generically you use etl-cli as follows:
etl [source] [target]
source can be any of the following:
.js / .ts - javascript or typescript program exporting a stream (loaded via tsx)
.json - file with json objects separated by newline (or a single json document with --json_multiline)
.csv - csv file
.xlsx - excel file, first sheet
.parquet - parquet file (row groups streamed, projection via --columns=a,b,c)
.xml - xml file (use --selector to choose the element)
http(s) or s3 link to a .csv, .json, .xlsx, .parquet or .xml file
stdin (you have to specify --source_type=json|csv|xlsx|parquet|xml since there is no extension to sniff)
directory of files (files/, emits { filename, body } records, body is lazy)
sftp directory (sftp/, walks remote directory and emits { filename, body } records)
s3 prefix (s3files// or s3://bucket/prefix, emits { filename, body } records)
database/collection/table (elastic, mysql, mssql, postgres, mongo, athena)
target can be any of the following:
.json
.csv
.parquet (requires --schema=schemafile.json with a parquet property)
directory of files (files/, writes each { filename, body } record to disk)
sftp directory (sftp/, uploads each { filename, body } record over sftp)
s3 link to either .json or .csv file (single object), or s3files// for multiple objects
database/collection/table (elastic, mysql, postgres, mongo, bigquery)
screen (default if no target specified - pretty-prints to stdout)
test (collects records into the result for programmatic / test usage)
If the target is .csv then any nested fieldnames will be flattened to a path using > (or optional --separator=x) as a separator. Structure will be determined by prescanning 100 lines by default (can be modified by --prescan=x)
Both source and target accept a path-like syntax that gets parsed into source_* / target_* properties. For example mongo/mydb/mycoll becomes source_db_name=mydb and source_collection=mycoll, elastic/myindex/mytype becomes source_index=myindex and source_indextype=mytype, and so on. The segments after the first are also exposed as source_params / target_params (an array) for sources/targets that interpret them directly (s3, sftp, files).
Command line arguments (optional)
--silent: suppress process notifications
--collect=x: collect x number of records for bulk insert/upsert
--concurrency=x: default concurrency for files/s3files/sftp targets and transforms
--target_concurrency=x: concurrency for db targets that support it (elastic default 5, bigquery default 10)
--separator: separator for flattening nested objects into csv (default >)
--prescan=x: number of rows to prescan for csv header discovery (default 100)
--transform=x: javascript transform between source and target (inline arrow function, or comma-separated module paths)
--transform_concurrency=x: concurrency for transforms
--chain=x: javascript chain between source and target (operates on the upstream stream rather than per record)
--filter=x: inline filter function, e.g. --filter="d => d.country == 'US'"
--select=a,b,c: keep only these top-level keys
--remove=a,b: drop these top-level keys
--setid=key: copy d[key] into d._id (handy before mongo/elastic targets)
--limit=n: stop emitting after n output records
--count: short-circuit the pipeline and emit [{ recordCount }] from the source's recordCount function
--proxy=x: proxy (for use with argv.getProxy()). String can contain {{random}} which gets substituted per call. If x matches a key in .etlconfig.json it is replaced with that key's value.
--source_query: query to be applied to the source
--source_query_file: use query from either JSON or .js/.ts file that exports a function (function gets argv and may return a string or json)
--schema=file : optional file with schemas for elasticsearch, bigquery or parquet
--prescan_size=x: number of records used to infer a schema when target needs one (default 1000)
--export_schema=true: short-circuit the pipeline and emit the inferred (or supplied) JSON schema as a single record
--export_glue_schema=true: emit AWS Glue / Athena column definitions derived from the schema
--replace_table: bigquery - will delete the table and recreate with same schema before inserting
--config=path: load an extra nconf-compatible json config file at runtime
--report_interval=ms: how often progress is logged (default 1000)
--highWaterMark=n: backpressure setting for the post-transform stream
--throw: throw transform/target errors instead of swallowing them
Javascript source
A typical source is a javascript file that fetches something from a web API, FTP site, database, or other remote location. The file must export a stream function that receives argv (command...