proof is a Python library for creating optimized, repeatable and self-documenting data analysis pipelines. proof was designed to be used with the agate data analysis library, but can be used with numpy, pandas or any other method of processing data. Important links: * Documentation: http://proof.rtfd.org * Repository: https://github.com/wireservice/proof * Issues: https://github.com/wireservice/proof/issues
A Python library for creating fast, repeatable and self-documenting data analysis pipelines.
proof is a Python library for creating optimized, repeatable and self-documenting data analysis pipelines. proof was designed to be used with the agate data analysis library, but can be used with numpy, pandas or any other method of procesCategory: Python / Data Analysis |
Watchers: 11 |
Star: 227 |
Fork: 23 |
Last update: Feb 3, 2023 |
Any reason the library is a package and not a single Python module?
Might make sense as a way of allowing "print blocks"
Formerly onyxfish/agate#184
Hey-lo,
I'm building a version of proof
using conda
for conda-forge. When possible, we try to include a link to the license file in the meta.yaml
specification for the build; doing so requires the license be indexed in an explicit MANIFEST.in
file so that it gets included in the source distribution.
Also, if you could see your way to pushing version 0.4.0
to pypi at some point, that would be excellent too. (Currently using the 0.3.0
build.)
Hello,
Would be great if there was a built-in way to define that a process will be the last one executed. Something like:
works as expected...
analysis = proof.Analysis(initialize_data, last_process=last_process)
analysis.then(process_0)
analysis.then(process_1)
analysis.then(process_2)
analysis.then(last_process)
raises an exception
analysis = proof.Analysis(initialize_data, last_process=last_process)
analysis.then(process_0)
analysis.then(process_1)
analysis.then(process_2)
analysis.then(last_process)
analysis.then(process_3)
# Exception(Other process was added after the one specified as the last one.)
raises an exception
analysis = proof.Analysis(initialize_data, last_process=last_process)
analysis.then(process_0)
analysis.then(process_1)
analysis.then(process_2)
# Exception(It's missing the last one!)
In the last example, instead an exception, the call of the method then
(with the last process) could be implicit, but you know... "Explicit is better than implicit."
What do you think ?
This would make it easier to use Proof to load and clean your data, then to switch to exploratory analysis with another tool, such as a Jupyter notebook.
Hello,
Instead to use a function, I prefer to write a class aiming a better description of my analysis rules. So, I defined a callable:
class MyCallable(object):
def __init__(self):
# proof.Analysis will look for a dunder name attribute (duck typing)
self.__name__ = self.__class__.__name__
def __call__(self, data):
# analysis rules
Then, I use it in the Analysis.then
method:
import proof
analysis_one = MyCallable()
myproof = proof.Analysis(initial_data)
myproof.then(analysis_one)
The proof.Analisys._fingerprint is considering only functions and instead any callable like.
TL;DR I got this exception: https://github.com/python/cpython/blob/master/Lib/inspect.py#L624
I think this PR could be an initial step to a better way to handle callable in this method (_fingerprint).
P.S - I'm sorry if I broke some PR rule/template. I would be happy to discuss this point, improve this PR and write more tests if necessary.
There are often cases where multiple reporters, with different skillsets, will be working on analysis in parallel. Often the common language is a SQL database. It would be cool to be able to replace the disk-based Pickle cache with a cache that is storing the data to SQL tables. This is outside the scope of proof, but parameterizing the cache would let people write their own cache implementations.
I think this is a proof issue?
Sample code:
import agate
import proof
def load_data(data):
data['responses'] = agate.Table.from_csv('responses.csv')
def unweighted_totals(data):
data['first'] = data['responses'].group_by('first')
if __name__ == '__main__':
data_loaded = proof.Analysis(load_data)
data_loaded.then(unweighted_totals)
Result:
File "tally.py", line 31, in <module>
data_loaded.run()
File "/Users/tylerfisher/.virtualenvs/listeners2015/lib/python2.7/site-packages/proof/analysis.py", line 204, in run
analysis.run(refresh=refresh, _parent_cache=self._cache)
File "/Users/tylerfisher/.virtualenvs/listeners2015/lib/python2.7/site-packages/proof/analysis.py", line 192, in run
local_data = _parent_cache.get()
File "/Users/tylerfisher/.virtualenvs/listeners2015/lib/python2.7/site-packages/proof/analysis.py", line 50, in get
return deepcopy(self._data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 237, in _deepcopy_tuple
y.append(deepcopy(a, memo))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 298, in _deepcopy_inst
state = deepcopy(state, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 174, in deepcopy
y = copier(memo)
TypeError: cannot deepcopy this pattern object
If I remove proof from the equation:
import agate
data = {}
def load_data():
data['responses'] = agate.Table.from_csv('responses.csv')
def unweighted_totals():
data['first'] = data['responses'].group_by('first')
if __name__ == '__main__':
load_data()
unweighted_totals()
It all works, and I can do things with my grouped data.
I like the idea behind proof, but find the step chaining cumbersome in some cases. I know this isn't very constructive feedback and I'm still working out exactly which cases these are. Rather than feature requesting and making the tool more opinionated about workflow, I'm wondering if the caching logic/API could be factored out into a separate package, allowing the persisting of data and passing it between steps to work using a different task runner, such as invoke.
Factoring this out into a separate package could also parameterize this operation, one day allowing power users to address #11 by serializing to a different format assuming their data is easily serialized as JSON (or something else).