Skip to content

Data Science Package for Python


WarehousePG provides a collection of data science-related Python modules that can be used with the WarehousePG PL/Python language. This section contains the following information:

For information about the WarehousePG PL/Python Language, see WarehousePG PL/Python Language Extension.

Parent topic: Installing Optional Extensions (WarehousePG)

Data Science Package for Python 3.11 Modules

The following table lists the modules that are provided in the Data Science Package for Python 3.11.

Module NameDescription/Used For
absl-pyAbseil Python Common Libraries
arvizExploratory analysis of Bayesian models
astorRead/rewrite/write Python ASTs
astunparseAn AST unparser for Python
autogradEfficiently computes derivatives of numpy code
autograd-gammaautograd compatible approximations to the derivatives of the Gamma-family of functions
backports.csvBackport of Python 3 csv module
beautifulsoup4Screen-scraping library
blisThe Blis BLAS-like linear algebra library, as a self-contained C-extension
cachetoolsExtensible memoizing collections and decorators
catalogueSuper lightweight function registries for your library
catboostA high-performance open source library for gradient boosting on decision trees
certifiPython package for providing Mozilla's CA Bundle
cffiForeign Function Interface for Python calling C code
cftimeTime-handling functionality from netcdf4-python
charset-normalizerThe Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
cherootHighly-optimized, pure-python HTTP server
CherryPyObject-Oriented HTTP framework
clickComposable command line interface toolkit
convertdateConverts between Gregorian dates and other calendar systems
cryptographyA set of functions useful in cryptography and linear algebra
cyclerComposable style cycles
cymemManage calls to calloc/free through Cython
CythonThe Cython compiler for writing C extensions for the Python language
datasetsHuggingFace community-driven open-source library of datasets
deprecatPython @deprecat decorator to deprecate old python classes, functions or methods
dillserialize all of python
fastprogressA nested progress with plotting options for fastai
feedparserUniversal feed parser, handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds
filelockA platform independent file lock
flatbuffersThe FlatBuffers serialization format for Python
fonttoolsTools to manipulate font files
formulaicAn implementation of Wilkinson formulas
funcyA fancy and practical functional tools
futureClean single-source support for Python 3 and 2
gastPython AST that abstracts the underlying Python version
gensimPython framework for fast Vector Space Modelling
gluontsGluonTS is a Python toolkit for probabilistic time series modeling, built around MXNet
google-authGoogle Authentication Library
google-auth-oauthlibGoogle Authentication Library
google-pastapasta is an AST-based Python refactoring library
graphvizSimple Python interface for Graphviz
greenletLightweight in-process concurrent programming
grpcioHTTP/2-based RPC framework
h5pyRead and write HDF5 files from Python
hijri-converterAccurate Hijri-Gregorian dates converter based on the Umm al-Qura calendar
holidaysGenerate and work with holidays in Python
idnaInternationalized Domain Names in Applications (IDNA)
importlib-metadataRead metadata from Python packages
InstructorEmbeddingText embedding tool
interface-metaProvides a convenient way to expose an extensible API with enforced method signatures and consistent documentation
jaraco.classesUtility functions for Python class constructs
jaraco.collectionsCollection objects similar to those in stdlib by jaraco
jaraco.contextContext managers by jaraco
jaraco.functoolsFunctools like those found in stdlib
jaraco.textModule for text manipulation
Jinja2A very fast and expressive template engine
joblibLightweight pipelining with Python functions
kerasDeep learning for humans
Keras-PreprocessingEasy data preprocessing and data augmentation for deep learning models
kiwisolverA fast implementation of the Cassowary constraint solver
korean-lunar-calendarKorean Lunar Calendar
langcodesTools for labeling human languages with IETF language tags
libclangClang Python Bindings, mirrored from the official LLVM repo
lifelinesSurvival analysis in Python, including Kaplan Meier, Nelson Aalen and regression
limeLocal Interpretable Model-Agnostic Explanations for machine learning classifiers
llvmlitelightweight wrapper around basic LLVM functionality
lxmlPowerful and Pythonic XML processing library combining libxml2/libxslt with the ElementTree API
MarkdownPython implementation of Markdown
MarkupSafeSafely add untrusted strings to HTML/XML markup
matplotlibPython plotting package
more-itertoolsMore routines for operating on iterables, beyond itertools
murmurhashCython bindings for MurmurHash
mxnetAn ultra-scalable deep learning framework
mysqlclientPython interface to MySQL
netCDF4Provides an object-oriented python interface to the netCDF version 4 library
nltkNatural language toolkit
numbaCompiling Python code using LLVM
numexprFast numerical expression evaluator for NumPy
numpyScientific computing
oauthlibA generic, spec-compliant, thorough implementation of the OAuth request-signing logic
opt-einsumOptimizing numpys einsum function
orjsonFast, correct Python JSON library supporting dataclasses, datetimes, and numpy
packagingCore utilities for Python packages
pandasData analysis
pathypathlib.Path subclasses for local and cloud bucket storage
patsyPackage for describing statistical models and for building design matrices
PatternWeb mining module for Python
pdfminer.sixPDF parser and analyzer
PillowPython Imaging Library
pmdarimaPython's forecast::auto.arima equivalent
portendTCP port monitoring and discovery
preshedCython hash table that trusts the keys are pre-hashed
prophetAutomatic Forecasting Procedure
protobufProtocol buffers
psycopg2PostgreSQL database adapter for Python
pyasn1ASN.1 types and codecs
pyasn1-modulespyasn1-modules
pycparserC parser in Python
pydanticData validation and settings management using python type hints
pyLDAvisInteractive topic model visualization
pymc3Statistical modeling and probabilistic machine learning
PyMeeusPython implementation of Jean Meeus astronomical routines
pyparsingPython parsing
python-dateutilExtensions to the standard Python datetime module
python-docxCreate and update Microsoft Word .docx files
PyTorchTensors and Dynamic neural networks in Python with strong GPU acceleration
pytzWorld timezone definitions, modern and historical
PyXB-XTo generate Python code for classes that correspond to data structures defined by XMLSchema
regexAlternative regular expression module, to replace re
requestsHTTP library
requests-oauthlibOAuthlib authentication support for Requests
rougeFull Python ROUGE Score Implementation (not a wrapper)
rsaOAuthlib authentication support for Requests
sacrebleuHassle-free computation of shareable, comparable, and reproducible BLEU, chrF, and TER scores
scikit-learnMachine learning data mining and analysis
scipyScientific computing
semverPython helper for Semantic Versioning
sentence_transformersMultilingual Sentence, Paragraph, and Image Embeddings using BERT & Co.
sgmllib3kPy3k port of sgmllib
shapA unified approach to explain the output of any machine learning model
sixPython 2 and 3 compatibility library
sklearnA set of python modules for machine learning and data mining
smart-openUtilities for streaming large files (S3, HDFS, gzip, bz2, and so forth)
soupsieveA modern CSS selector implementation for Beautiful Soup
spacyLarge scale natural language processing
spacy-legacyLegacy registered functions for spaCy backwards compatibility
spacy-loggersLogging utilities for SpaCy
spectrumSpectrum Analysis Tools
SQLAlchemyDatabase Abstraction Library
srslyModern high-performance serialization utilities for Python
statsmodelsStatistical modeling
temporaObjects and routines pertaining to date and time
tensorboardTensorBoard lets you watch Tensors Flow
tensorboard-data-serverFast data loading for TensorBoard
tensorboard-plugin-witWhat-If Tool TensorBoard plugin
tensorflowNumerical computation using data flow graphs
tensorflow-estimatorWhat-If Tool TensorBoard plugin
tensorflow-io-gcs-filesystemTensorFlow IO
termcolorsimple termcolor wrapper
Theano-PyMCTheano-PyMC
thincPractical Machine Learning for NLP
threadpoolctlPython helpers to limit the number of threads used in the threadpool-backed of common native libraries used for scientific computing and data science
toolzList processing tools and functional utilities
tqdmFast, extensible progress meter
transformersState-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
tslearnA machine learning toolkit dedicated to time-series data
typerTyper, build great CLIs. Easy to code. Based on Python type hints
typing_extensionsBackported and Experimental Type Hints for Python 3.7+
urllib3HTTP library with thread-safe connection pooling, file post, and more
wasabiLightweight console printing and formatting toolkit
WerkzeugComprehensive WSGI web application library
wraptModule for decorators, wrappers and monkey patching
xarrayN-D labeled arrays and datasets in Python
xarray-einstatsStats, linear algebra and einops for xarray
xgboostGradient boosting, classifying, ranking
xmltodictMakes working with XML feel like you are working with JSON
zc.lockfileBasic inter-process locks
zippBackport of pathlib-compatible object wrapper for zip files
tensorflowNumerical computation using data flow graphs
kerasAn implementation of the Keras API that uses TensorFlow as a backend

Installing a Data Science Package for Python

Before you install a Data Science Package for Python, make sure that your WarehousePG is running, you have sourced greenplum_path.sh, and that the $COORDINATOR_DATA_DIRECTORY and $GPHOME environment variables are set.

Note The PyMC3 module depends on Tk. If you want to use PyMC3, you must install the tk OS package on every node in your cluster. For example:

$ sudo yum install tk
  1. Locate the Data Science Package for Python that you built or downloaded.

  2. Copy the package to the WarehousePG coordinator host.

  3. Install the package.

  4. Restart WarehousePG. You must re-source greenplum_path.sh before restarting your WarehousePG cluster:

    $ source /usr/local/greenplum-db/greenplum_path.sh
    $ gpstop -r

The Data Science Package for Python modules are installed in the following directory:

$GPHOME/ext/DataSciencePython/lib/python3.11/site-packages/

Uninstalling a Data Science Package for Python

The command removes the Data Science Package for Python modules from your WarehousePG cluster. It also updates the PYTHONPATH, PATH, and LD_LIBRARY_PATH environment variables in your greenplum_path.sh file to their pre-installation values.

Re-source greenplum_path.sh and restart WarehousePG after you remove the Python Data Science Module package:

$ . /usr/local/greenplum-db/greenplum_path.sh
$ gpstop -r

Note After you uninstall a Data Science Package for Python from your WarehousePG cluster, any UDFs that you have created that import Python modules installed with this package will return an error.