Distributed Computing Framework for Node.js
Early development stage: this project was still under early development, many necessery feature was not done yet, use it on your own risk.
A node.js version of Spark, without hadoop or jvm.
You should read tutorial first, then you can learn Spark but use this project instead.
Async API & deferred API
Any api that requires a RDD and generate a result is async, like
max ... Any api that creates a RDD is deferred API, which is not async, so you can chain them like this:
await dcc .parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) .map(v => v + 1) .filter(v => v % 2 === 0) .take(10); // take is not deferred api but async
- local master.
- rdd & partition creation & release.
- map & reduce
- repartition & reduceByKey
- disk storage partitions
- file loader & saver
- export module to npm
- decompresser & compresser
- use debug module for information/error
- provide a progress bar.
- object hash(for key) method
- storage MEMORY_OR_DISK, and use it in sort
- storage MEMORY_SER，storage in memory but off v8 heap.
- config default partition count.
0.2.x: Remote mode
- distributed master
- runtime sandbox
- plugin system
- remote dependency management
- aliyun oss loader
- hdfs loader
How to use
Install from npm(shell only)
npm install -g dcf #or yarn global add dcf
Then you can use command:
Install from npm(as dependency)
npm install --save dcf #or yarn add dcf
Run samples & cli
download this repo, install dependencies
npm install # or yarn
npm run ts-node src/samples/tutorial-0.ts npm run ts-node src/samples/repartition.ts
Run interactive cli: