Make Any Code Linter Report Incrementally

Ming
5 min readOct 23

--

Incrementally improving code quality can be challenging, especially when working with large codebases and code linters that operate at the file level.

Photo of a small code change in a large file, by luis gomes via Pexels

Imagine this: A developer has made a minor change to an 800-line Python script, and the linter is bombarding them with 99+ warnings, blocking their commit. This situation not only disrupts the development process but also hinders the principle of “doing one thing and one thing well” in each pull request. Additionally, it’s unfair to expect developers to clean up code they didn’t initially write.

Let’s not do that to our fellow teammates! We should only let people fix problems on the lines-of-code that they touched. Put another way: you want linters to report its findings incrementally.

Writing a glue script

Luckily, this is easy to achieve. We rely on the fact that linters always print out row numbers with their findings. The plan is intuitive: We wrap around the linter with a script, which filters its output for lines on modified lines of code only.

For glue work like this, Python really shines. In Python, we’ll be using these two packages to achieve our goal:

  • To retrieve git changes, let’s use GitPython. Although you can use git CLI in a subprocess and achieve the same result, allow me to be as Pythonic as possible for pragmaticality’s sake.
  • Unidiff, a library for parsing changes emitted from GitPython. The name “unidiff” is shortened from “unified diff”, which is one format of diff data.

Since we will register this script as a pre-commit hook, in place of the vanilla linter itself, the script will receive a file path to the modified file from the pre-commit framework. It will receive the path to only one file at a time:

# Logger initialization omitted.
if __name__ == "__main__":
if len(sys.argv) < 2:
logger.fatal("No CLI argument provided.")
sys.exit()
file_path = sys.argv[1]
logger.info(f"Checking the staged changes in `{file_path}`.")

Upon execution, our script should retrieve the code changes. What counts as a “code change” actually depends on when and where you’re running the script:

  • At pre-commit time, the relevant changes will be everything in the staging area.
  • Almost definitely, your team has a CI that run PR checks. At post-commit-but-pre-merge time, the relevant changes are the differences with your target branch (often main or master).

To handle both cases, the next part of our code looks like this:

repo = git.Repo()
# Look for staged changes first.
patches = repo.index.diff("HEAD", paths=file_path, create_patch=True)
# If no changes are staged, compare with the target branch.
# Note that we explicitly say `origin`, so that `git` knows to find
# it remotely instead of locally. The local `main` branch can be
# very out of date in practice.
if not patches:
patches = repo.index.diff("origin/main", paths=file_path, create_patch=True)
# If there's still no changes, exit.
if not patches:
logger.fatal("The file specified isn't modified.")
sys.exit()

Now, patches will be a list of objects, each corresponding to one file. Since the pre-commit framework supplies only one file at a time, we are certain that we only have one item in this list. The work next is to parse the patch from a unidiff-formatted string to objects.

Since unidiff is designed to parse patches of multiple files only, we need to prepend the patch with a header. The code is as follows:

patch = patch[0]
try:
patch = PatchSet("--- /dev/null\n+++ /dev/null\n" + patch.diff.decode())[0]
except Exception as e:
logging.error(f"Failed to parse the patch: {e}")
sys.exit()

A patch can contain multiple “hunks”. A hunk to a patch is a block to a file. We are only interested in code that are added, not deleted. Modified lines are just lines that are deleted in a form and added in another.

added_line_numbers = []
for hunk in patch:
for line in hunk:
if line.is_added:
added_line_numbers.append(line.target_line_no)
logger.info(f"{len(added_line_numbers)} lines added.")

Before or after collecting the line numbers of new code, we can invoke the linter itself. Remember to supply all necessary arguments here. Still using flake8 as our example:

result = subprocess.run(
["flake8", patch.b_path, "--ignore=W503,E501"] + sys.argv[2:],
capture_output=True,
text=True,
)
if result.returncode == 0:
logger.info("No errors found.")
sys.exit()
# Else, `flake8` has indicated that we have some errors in this file.
findings = result.stdout.splitlines()

Finally, we can parse the analysis result and filter for findings on added lines only:

is_any_finding_in_added_lines = False
for finding in findings:
try:
# Perhaps use a regex for linters that more complex
code_line_number = int(finding.split(":", 2)[1])
except Exception:
continue
# If the line number is not in the changed hunks, remove it from the output.
if code_line_number in added_line_numbers:
print(finding)
is_any_finding_in_added_lines = True
sys.exit(1 if is_any_finding_in_added_lines else 0)

That’s it! To use this wrapper, add the following entry to your .pre-commit-config.yaml:

repos:
- repo: https://github.com/pycqa/flake8
rev: 6.1.0 # or the revision you want.
hooks:
- id: flake8
# We override the entry here.
# Without overriding, the default entry would be the `flake8` binary.
entry: flake8_wrapper.py
additional_dependencies: [
# In addition to your usual extensions...
flake8-bugbear,pep8-naming,
# ... remember to also declare dependency on these 2 packages
# that our wrapper script require.
gitpython, unidiff,
]

and then save our script as flake8_wrapper.py inside your repo.

Summary

It’s not just for flake8 or Python codebase -- This wrap-a-linter strategy also works for other programming languages and static analyzers. By breaking down the thought process of constructing such a wrapper script, I hope you can easily adopt it to your own situations.

Photo of some hooks and (possibly) some lints, by Karolina Grabowska via Pexels

Adopting this incremental code analysis strategy can enhance the development process by reducing distractions, improving developer satisfaction, and maintaining a cleaner, more organized codebase. Python’s versatility and simplicity make it an excellent tool for creating such solutions, allowing your team to make code quality improvements with minimal disruption.

--

--

Ming

Tech writer with creative analogies. | Donate: https://ko-fi.com/mingyli