Python and Data Science .gitattributes

Configure .gitattributes for Python projects including Jupyter notebooks, data files, trained models, and scientific computing artifacts.

Language-Specific

Detailed Explanation

Python / Data Science .gitattributes

Python data science projects combine source code, Jupyter notebooks, large datasets, and trained model files. Each category needs different Git handling.

Recommended Configuration

# Auto detect
* text=auto

# Python source
*.py     text diff=python
*.pyx    text diff=python
*.pxd    text diff=python
*.pyi    text diff=python

# Config / packaging
*.cfg    text
*.ini    text
*.toml   text
*.yaml   text
*.yml    text
setup.py text diff=python
pyproject.toml text

# Jupyter notebooks
*.ipynb  text -diff

# Lock files
poetry.lock   text -diff
Pipfile.lock  text -diff
requirements.txt text

# Data files (text-based)
*.csv    text
*.tsv    text
*.json   text
*.jsonl  text
*.xml    text

# Data files (binary)
*.pkl    binary
*.pickle binary
*.parquet binary
*.feather binary
*.hdf5   binary
*.h5     binary
*.npy    binary
*.npz    binary
*.arrow  binary

# Trained models
*.pt     binary
*.pth    binary
*.onnx   binary
*.pb     binary
*.tflite binary
*.joblib binary

# Python compiled
*.pyc    binary
*.pyd    binary
*.so     binary
*.egg    binary
*.whl    binary

# Images / plots
*.png    binary
*.jpg    binary
*.jpeg   binary
*.svg    text

# Shell scripts
*.sh     text eol=lf

Jupyter Notebooks and -diff

Jupyter .ipynb files are JSON with embedded outputs (images encoded as base64, HTML tables, etc.). Standard diffs of notebooks are nearly unreadable. Using text -diff ensures line ending normalization while suppressing noisy diffs.

For better notebook diffs, consider tools like nbdime:

# In .gitconfig
[diff "jupyternotebook"]
  command = git-nbdiffdriver diff
[merge "jupyternotebook"]
  command = git-nbmergedriver merge %O %A %B %L %P
# In .gitattributes (with nbdime)
*.ipynb text diff=jupyternotebook merge=jupyternotebook

Model and Data Files

Trained models (.pt, .onnx, .pb) and data files (.parquet, .hdf5) are binary. For large models, consider Git LFS:

*.pt  filter=lfs diff=lfs merge=lfs -text
*.h5  filter=lfs diff=lfs merge=lfs -text

Use Case

Machine learning and data science teams working with Python, Jupyter notebooks, and large datasets need these attributes to handle the variety of file formats in their workflows. Proper configuration prevents notebook merge conflicts and model file corruption.

Try It — .gitattributes Generator

Open full tool