Automating GitLab Artifact Cleanup with Jenkins

We automated GitLab artifact cleanup using a Python script and Jenkins, keeping recent jobs and deleting older artifacts to reclaim storage safely without affecting active pipelines.

Automating GitLab Artifact Cleanup with Jenkins
Photo by Jon Tyson / Unsplash

In our organization, we use a self-hosted GitLab instead of the SaaS version. While this gives us full control, it also means that storage management becomes our responsibility. During my first month, I was tasked with reducing disk usage on GitLab, but with a constraint: I didn’t have access to the GitLab server directly. This meant no manual checks, only API access.

To identify which repositories were consuming the most storage, I wrote a Python script to fetch project statistics from GitLab. The script collects total storage size, repository size, job artifacts size, container registry size, wiki size, LFS usage, and the last activity timestamp. Exporting the results to a CSV gave us a clear overview and helped us pinpoint the main culprit: job artifacts. These are files produced by CI/CD jobs, such as build outputs, test reports, logs, or packaged applications, which can grow uncontrollably if not cleaned up regularly.

Python Script to Collect GitLab Storage Data

import requests
import csv
from datetime import datetime
from zoneinfo import ZoneInfo

GITLAB_URL = "https://git.company.com"
PRIVATE_TOKEN = "adalahpokoknya"

headers = {"PRIVATE-TOKEN": PRIVATE_TOKEN}
all_projects = []

for page in range(1, 11):
    r = requests.get(
        f"{GITLAB_URL}/api/v4/projects",
        headers=headers,
        params={
            "statistics": True,
            "order_by": "storage_size",
            "sort": "desc",
            "per_page": 100,
            "page": page
        }
    )

    if r.status_code != 200:
        print("Error:", r.text)
        break

    data = r.json()
    if not data:
        break

    all_projects.extend(data)

with open("gitlab_storage_report.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow([
        "Project", "Namespace", "Total_Storage_MB", "Repository_MB",
        "Artifacts_MB", "Registry_MB", "Wiki_MB", "LFS_MB",
        "Visibility", "Last_Activity"
    ])

    for p in all_projects:
        stats = p.get("statistics", {})
        last_activity_raw = p.get("last_activity_at")
        if last_activity_raw:
            dt = datetime.fromisoformat(last_activity_raw)
            dt_gmt7 = dt.astimezone(ZoneInfo("Asia/Jakarta"))
            last_activity = dt_gmt7.strftime("%Y-%m-%d %H:%M:%S")
        else:
            last_activity = ""

        writer.writerow([
            p.get("name"),
            p.get("path_with_namespace"),
            round(stats.get("storage_size", 0) / (1024 * 1024), 2),
            round(stats.get("repository_size", 0) / (1024 * 1024), 2),
            round(stats.get("job_artifacts_size", 0) / (1024 * 1024), 2),
            round(stats.get("container_registry_size", 0) / (1024 * 1024), 2),
            round(stats.get("wiki_size", 0) / (1024 * 1024), 2),
            round(stats.get("lfs_objects_size", 0) / (1024 * 1024), 2),
            p.get("visibility"),
            last_activity
        ])

After generating the CSV, it became evident that most storage was consumed by artifacts. Many projects were retaining artifacts indefinitely, and old pipelines were never cleaned. To reduce risk while reclaiming storage, we decided to keep the latest 70 percent of jobs and delete the oldest 30 percent. This approach protects recent pipelines while significantly reducing storage usage.

For automation, we leveraged Jenkins, which was already integrated in our CI/CD ecosystem. The cleanup pipeline runs daily and has two stages: a dry run and execution. The dry run fetches jobs, calculates total and deletable artifact storage, and generates a log and a deletion list without actually deleting anything. The execution stage deletes artifacts using the GitLab API.

Jenkins Pipeline for Artifact Cleanup

pipeline {
    agent {
        kubernetes {
            label 'cleanup-agent'
            defaultContainer 'cleanup'
            yaml """
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: cleanup
    image: alpine:3.23.3
    command:
    - cat
    tty: true
"""
        }
    }

    parameters {
        string(name: 'PROJECT_IDS_INPUT', defaultValue: '1938,5638,2568,2569,2566,2576,2570,5302,2007', description: 'Comma-separated Project IDs')
        booleanParam(name: 'RUN_DELETE', defaultValue: false, description: 'Set true to actually delete artifacts')
    }

    environment {
        LOG_FILE = 'artifact_cleanup.log'
        TOKEN = credentials('pat-gitlab-cleanup')
    }

    triggers {
        cron('H 3 * * *')
    }

    stages {
        stage('Cleanup GitLab Artifacts - Dry Run') {
            steps {
                container('cleanup') {
                    sh '''
                    apk add --no-cache curl jq coreutils
                    echo "=== Artifact Cleanup Dry Run (Keep 70%) ===" > $LOG_FILE
                    : > delete_ids.txt
                    PROJECT_IDS=${PROJECT_IDS_INPUT:-'1938,5638,2568,2569,2566,2576,2570,5302,2007'}
                    ...
                    '''
                }
            }
        }

        stage('Cleanup GitLab Artifacts - Execute') {
            when {
                anyOf {
                    expression { return params.RUN_DELETE == true }
                    triggeredBy 'TimerTrigger'
                }
            }
            steps {
                container('cleanup') {
                    sh '''
                    apk add --no-cache curl
                    echo "=== Executing Artifact Cleanup ===" >> $LOG_FILE
                    ...
                    '''
                }
            }
        }

        stage('Archive Logs') {
            steps {
                container('cleanup') {
                    archiveArtifacts artifacts: 'artifact_cleanup.log', allowEmptyArchive: true
                }
            }
        }
    }
}

Example Output

Project 1938 - Total Jobs: 500
Keep: 350
Delete: 150

Total Artifact Storage : 2048 MB
To Be Deleted          : 820 MB
Remaining After Cleanup: 1228 MB

Global Summary:
Projects processed       : 9
Total Artifact Storage   : 10,240 MB
Scheduled for Deletion   : 4,200 MB
Remaining After Cleanup  : 6,040 MB

This pipeline allowed us to automatically clean up old artifacts, reclaiming storage without affecting active pipelines. By combining Python scripts for analysis, a clear retention policy, and Jenkins for automation, we were able to keep the GitLab instance manageable even without direct server access. The approach is safe, repeatable, and can be adapted for different retention policies per project or branch.