ETL CLI – the Swiss army knife you always wanted

zjonsson1 pts0 comments

etl-cli - npm

npm

Search<br>Sign UpSign In

etl-cli<br>0.1.34 • Public • Published 2 days ago<br>Readme<br>Code Beta<br>33 Dependencies<br>0 Dependents<br>156 Versions

Command line tool for etl pipelines.

Install globally to have etl available on the command line:

npm install etl-cli -g

Generically you use etl-cli as follows:

etl [source] [target]

source can be any of the following:

.js / .ts - javascript or typescript program exporting a stream (loaded via tsx)

.json - file with json objects separated by newline (or a single json document with --json_multiline)

.csv - csv file

.xlsx - excel file, first sheet

.parquet - parquet file (row groups streamed, projection via --columns=a,b,c)

.xml - xml file (use --selector to choose the element)

http(s) or s3 link to a .csv, .json, .xlsx, .parquet or .xml file

stdin (you have to specify --source_type=json|csv|xlsx|parquet|xml since there is no extension to sniff)

directory of files (files/, emits { filename, body } records, body is lazy)

sftp directory (sftp/, walks remote directory and emits { filename, body } records)

s3 prefix (s3files// or s3://bucket/prefix, emits { filename, body } records)

database/collection/table (elastic, mysql, mssql, postgres, mongo, athena)

target can be any of the following:

.json

.csv

.parquet (requires --schema=schemafile.json with a parquet property)

directory of files (files/, writes each { filename, body } record to disk)

sftp directory (sftp/, uploads each { filename, body } record over sftp)

s3 link to either .json or .csv file (single object), or s3files// for multiple objects

database/collection/table (elastic, mysql, postgres, mongo, bigquery)

screen (default if no target specified - pretty-prints to stdout)

test (collects records into the result for programmatic / test usage)

If the target is .csv then any nested fieldnames will be flattened to a path using > (or optional --separator=x) as a separator. Structure will be determined by prescanning 100 lines by default (can be modified by --prescan=x)

Both source and target accept a path-like syntax that gets parsed into source_* / target_* properties. For example mongo/mydb/mycoll becomes source_db_name=mydb and source_collection=mycoll, elastic/myindex/mytype becomes source_index=myindex and source_indextype=mytype, and so on. The segments after the first are also exposed as source_params / target_params (an array) for sources/targets that interpret them directly (s3, sftp, files).

Command line arguments (optional)

--silent: suppress process notifications

--collect=x: collect x number of records for bulk insert/upsert

--concurrency=x: default concurrency for files/s3files/sftp targets and transforms

--target_concurrency=x: concurrency for db targets that support it (elastic default 5, bigquery default 10)

--separator: separator for flattening nested objects into csv (default >)

--prescan=x: number of rows to prescan for csv header discovery (default 100)

--transform=x: javascript transform between source and target (inline arrow function, or comma-separated module paths)

--transform_concurrency=x: concurrency for transforms

--chain=x: javascript chain between source and target (operates on the upstream stream rather than per record)

--filter=x: inline filter function, e.g. --filter="d => d.country == 'US'"

--select=a,b,c: keep only these top-level keys

--remove=a,b: drop these top-level keys

--setid=key: copy d[key] into d._id (handy before mongo/elastic targets)

--limit=n: stop emitting after n output records

--count: short-circuit the pipeline and emit [{ recordCount }] from the source's recordCount function

--proxy=x: proxy (for use with argv.getProxy()). String can contain {{random}} which gets substituted per call. If x matches a key in .etlconfig.json it is replaced with that key's value.

--source_query: query to be applied to the source

--source_query_file: use query from either JSON or .js/.ts file that exports a function (function gets argv and may return a string or json)

--schema=file : optional file with schemas for elasticsearch, bigquery or parquet

--prescan_size=x: number of records used to infer a schema when target needs one (default 1000)

--export_schema=true: short-circuit the pipeline and emit the inferred (or supplied) JSON schema as a single record

--export_glue_schema=true: emit AWS Glue / Athena column definitions derived from the schema

--replace_table: bigquery - will delete the table and recreate with same schema before inserting

--config=path: load an extra nconf-compatible json config file at runtime

--report_interval=ms: how often progress is logged (default 1000)

--highWaterMark=n: backpressure setting for the post-transform stream

--throw: throw transform/target errors instead of swallowing them

Javascript source

A typical source is a javascript file that fetches something from a web API, FTP site, database, or other remote location. The file must export a stream function that receives argv (command...

json file source target default parquet

Related Articles