- Big Data
- 15. Sep
Pachyderm is a language-agnostic and cloud infrastructure-agnostic large-scale data processing framework based on software containers. This chart can be used to deploy Pachyderm backed by object stores of different Cloud providers.
A pachyderm is a tool for production data pipelines. If you need to chain together data scraping, ingestion, cleaning, munging, wrangling, processing, modeling, and analysis in a sane way, then Pachyderm is for you. If you have an existing set of scripts which do this in an ad-hoc fashion and you're looking for a way to "production" them, Pachyderm can make this easy for you.
- Containerized: Pachyderm is built on Docker and Kubernetes. Whatever languages or libraries your pipeline needs, they can run on Pachyderm which can easily be deployed on any cloud provider or on-prem.
- Version Control: Pachyderm version controls your data as it's processed. You can always ask the system how data has changed, see a diff, and, if something doesn't look right, revert.
- Provenance (aka data lineage): Pachyderm tracks where data comes from. Pachyderm keeps track of all the code and data that created a result.
- Parallelization: Pachyderm can efficiently schedule massively parallel workloads.
- Incremental Processing: Pachyderm understands how your data has changed and is smart enough to only process the new data.
Pachyderm Open Source
Pachyderm's foundational technology is open source. This open source core is designed to enable sustainable data science workflows via a language-agnostic system for data versioning with data pipelining.
All of the data that flows into and out of a Pachyderm Pipeline stage is version controlled. You can look back to see what your training data looked like when a particular model was created or how your results have changed over time.
What Git does, in terms of Collaboration and Reproducibility, for code, Pachyderm does for your data. Collaborate on the same data with teammates and ensure that your analyses are kept in sync with the latest changes to data, or backtest models on historical states of data.
Containers for Analysis
Data scientists utilize a diverse set of tools, languages, and frameworks. Because Pachyderm utilizes software containers as the main element of data processing, data scientists and data engineers can use and combine any tools they need for a certain set of analyses or modeling.
Data scientists can develop code locally (e.g., in Jupyter) on samples of data and utilize that exact same code in a Docker image as a stage in a distributed Pachyderm pipeline. They don't need to import special libraries or add complication. They just declare what processing needs to be run on what data, and Pachyderm takes care of the details.
Pachyderm automatically parallelizes your analyses by providing subsets of data to multiple instances of your code. Data scientists don't have to worry themselves with explicit implementations of parallelism and data sharing in their code. They can keep their code simple (even single threaded) and let Pachyderm worry about the complications of distributed processing.
Data scientists can even declare types of nodes that should run their analyses (e.g., GPUs). Pachyderm will make sure that the right data get processed by the right types of nodes, and it will even help you auto-scale resources as your team or workloads grow or shrink.
Not all changes to code or data produce the expected results, and it can be super difficult to figure out what changes to code or data produced what results, especially as data science teams grow. Pachyderm makes this super easy!!
Pachyderm lets you quickly and easily understand the provenance of any result or intermediate piece of data. For example, you can quickly deduce which version of a model produced certain results and determine which training data was used to build that model. This let's data science teams iteratively build, change, and collaborate on analyses while ensuring that they can debug, maintain, and understand those analyses over time.