▲Making PyPI's test suite fasterblog.trailofbits.com

125 points by rbanffy 131 days ago | 39 comments

boyd 127 days ago [-]

Throwing cores at the problem with `pytest-xdist` is typically the lowest hanging fruit, but you still hit all the paper cuts the authors mention -- collection, DB fixtures, import time, etc.

And, further optimization is really hard when the CI plumbing starts to dominate. For example, the last Warehouse `test` job I checked has 43s of Github Actions overhead for 51s of pytest execution time (half the test action time and approaching 100% overhead).

Disclosure: Have been tinkering on a side project trying to provide 90% of these pytest optimizations automatically, but also get "time-to-first-test-failure" down to ~10 seconds (via warm runners, container snapshotting, etc.). Email in profile if anyone would like to swap notes.

cocoflunchy 127 days ago [-]

I don't understand why pytest's collection is so slow.

On our test suite (big django app) it takes about 15s to collect tests. So much that we added a util using ripgrep to find the file and pass it as an argument to pytest when using `pytest -k <testname>`.

Galanwe 127 days ago [-]

From my experience speeding up pytests with Django:

- Creating and migrating the test DB is slow. There is no shame in storing and committing a premigrated sqlite test DB generated upon release, it's often small in size and will save time for everyone.

- Stash your old migrations that nobody use anymore.

- Use python -X importtime and paste the result in an online viewer. Sometimes moving heavy imports to functions instead of the global scope will make individual tests slower, but collection will be faster.

- Use pytest-xdist

- Disable transactions / rollback on readonly tests. Ideally you want most of your non-inserting tests to work on the migrated/preloaded features in your sqlite DB.

We can enter into more details if you want, but the pre migrated DB + xdist alone allowed me to speedup tests on a huge project from 30m to 1m.

caidan 127 days ago [-]

Agreed, the db migrations are usually the slowest part. Another way to speed this up substantially if you are using postgres and need your test database to be postgres too, is to create and maintain a template database for your tests. This database should have all migrations already run on it and be loaded with whatever general use fixtures you will need. You can then use the Django TEMPLATE setting https://docs.djangoproject.com/en/5.1/ref/settings/#template and Django will clone that database when running your tests.

imp0cat 127 days ago [-]

Is there a way to use pytest-xdist and still keep the regular output?

kinow 127 days ago [-]

In their case I think they were no specifying any test path. Which would cause pytest to search or tests in multiple directories.

Another thing that can slow down pytest collection and bootstrap is how fixture are loaded. So reducing number or scope of fixtures may help too.

boxed 127 days ago [-]

I've done some work on making pytest faster, and it's mostly a case of death by a thousand paper cuts. I wrote hammett as an experimental benchmark to compare to.

piokoch 127 days ago [-]

Ehhh, those pesky Python people, complaining and complaining, average Spring Boot application takes 15s to start even looking if the code compiled ;)

thom 127 days ago [-]

Lest we start to malign the JVM as a whole, my Clojure test suite, which includes functional tests running headless browsers against a full app hitting real Postgres databases, runs end to end in 20s.

ffsm8 127 days ago [-]

The spring tests are generally quicker then the equivalent python test, so ime - the jvm is mostly to blame.

How much time actually goes by after you click "run test" (or run the equivalent cli command) until the test finished running?

Any projects using the jvm I've ever worked on (none of which were clojure, admittedly) have always taken at least 10-15s until the pre-phases were finished and the actual test setup began

thom 127 days ago [-]

If I completely clear all cached packages maybe, but I never do that locally or in CI/CD, and that's true of Python too (but no doubting UV is faster than Maven). Clojure/JVM startup time is less than half a second, obviously that's still infinitely more than Python or a systems language but tolerable to me. First test runs after about 2s? And obviously day to day these things run instantly because they're already loaded in a REPL/IPython. Maybe unfair to compare an interpreted language to a compiled one: building an uberjar would add 10 seconds but I'd never do that during development, which is part of the selling point I guess. Either way, I don't think the JVM startup time is really a massive issue in 2025, and I feel like whatever ecosystem you're in, you can always attack these slow test suites and improve your quality of life.

esafak 127 days ago [-]

It spins up a postgres container in that 20s?

thom 127 days ago [-]

Not a container but yes, it launches a cluster at the start of a run, and copies a blank Postgres template before every relevant test.

nine_k 127 days ago [-]

One thing not mentioned here is putting your test database on a RAM disk, aka tmpfs. This significantly speeds up all DB-related tests that use transactions, fixture loading, and migrations.

In most distros, /tmp is mounted as tmpfs, but YMMV.

qznc 127 days ago [-]

I generally try to avoid mocking completely. However, speeding up tests is an appropriate use. If someone changes the implementation the mock usually simply doesn't apply and the test still works as intended.

For example, a great speed optimization in our tests recently was to mock time.sleep.

Why do we have so many sleeps? This is testing a test framework for embedded devices where there is plenty of fiddling-then-wait-for-the-hardware.

I also mocked some file system accesses. Unit testing is about our application logic and not about Linux kernel behavior anyways.

NeutralForest 127 days ago [-]

Pretty good article, it's really a challenge to properly isolate DB operations during testing so having a difference instance per worker is nice. I remember trying to use different schemas (not instances) but I had a hard time to isolate roles as well.

lyu07282 127 days ago [-]

It's more work, but that's one benefit of clean architecture that abstracts the persistence layer. (You can replace it with an in-memory variant.)

NeutralForest 127 days ago [-]

I was using https://www.postgresql.org/docs/current/ddl-rowsecurity.html and needed to check that some complex policies were working correctly so I couldn't just replace with say, SQLite.

throwme_123 127 days ago [-]

Is Trail of Bits transitioning out of "crypto"?

Imho, they are one of the best auditors out there for smart contracts. Wouldn't be surprising to see some of these talented teams find bigger markets.

woodruffw 127 days ago [-]

No; Trail of Bits has always had multiple internal groups, including an OSS engineering group that does security and performance engineering. We still do plenty of audits as a company; you can see recent work on that front here[1] :-).

Source: I run the group that produced this work.

[1]: https://github.com/trailofbits/publications

frogsRnice 127 days ago [-]

You all do amazing work, hope I can boast the same someday - or even 50% of it ;)

Seriously, you are my heroes!

bsamuels 126 days ago [-]

In addition to what Will posted, published reports for blockchain projects tend to be skewed compared to our other groups.

Blockchain clients tend to want to publish the report, but that isn't true for our business lines/projects/clients that are more interesting to HN's audience.

frogsRnice 127 days ago [-]

Imo its not just crypto- a lot of their reports are enlightening to read

ustad 127 days ago [-]

The article uses pytest - does anyone have similar tips when using pythons builtin unittest?

masklinn 127 days ago [-]

The sys.monitoring and import optimisation suggestions apply as-is.

If you use standard unittest discovery the third item might apply as well, though probably not to the same degree.

I don’t think unittest has any support for distribution so the xdist stuff is a no.

On the other hand you could use unit test as the API with Pytest as your test runner. Then you can also use xdist. And eventually migrate to the Pytest test api because it’s so much better.

kinow 127 days ago [-]

I wwsn't familiar with this sys.monitoring option for coverage. Going to give it a try in my test suite. At the moment with docker testcontainers, gh actions test matrix for multiple python versions, and unit + regression + integration tests it is taking about 3-5 minutes.

darkamaul 127 days ago [-]

Warning, there is a change in coverage 7.7.0 that disables sysmon support for coverage if using branch coverage _and_ a version of Python before 3.14alpha6.

[0]: https://coverage.readthedocs.io/en/7.8.0/changes.html#versio...

kinow 127 days ago [-]

Ah, thank you! I think you just saved me some time!

anticodon 127 days ago [-]

I profiled a huge legacy tests collection using cProfile, and found lots of low hanging fruits. Like some tests were creating 4000x3000 Pillow image in memory just to test how image saving code works (checkign that filename and extension are correct). And hundreds of tests had created this huge image for every test (in the setUp method) because of unittest reliance on inheritance. Reducing size image to 10x5 made the test suite faster for like 5-7% (it was long time ago, so I don't remember exact statistics).

So, I'd run the tests under cProfile first.

dmurray 127 days ago [-]

But the changes in TFA were of the other of 75% improvement for "dumb" changes that were agnostic to the details of the tests being run.

Saying you got a 5-7% improvement from a single change, discovered using the profiler, that took understanding of the test suite and the domain to establish it was OK, and that actually changed the functionality under test - that's all an argument for doing exactly the opposite of what you recommend.

anticodon 127 days ago [-]

> that actually changed the functionality under test - that's all an argument for doing exactly the opposite of what you recommend.

It was an old functionality. Someone wrote a super class that for the need of testing filesystem functionality created extremely large images. Not only there was no need to test with such large images, other developers eventually inherited more testcases from that setup code (because there were other utility methods), and now setUp code was needlessly creating images that no test used.

Generating a huge 4k image takes a significant time using Pillow.

bgwalter 127 days ago [-]

I get that pytest has features that unittest does not, but how is scanning for test files in a directory considered appropriate for what is called a high security application in the article?

For high security applications the test suite should be boring and straightforward. pytest is full of magic, which makes it so slow.

Python in general has become so complex, informally specified and bug ridden that it only survives because of AI while silencing critics in their bubble.

The complexity includes PSF development processes, which lead to:

https://www.schneier.com/blog/archives/2024/08/leaked-github...

williamdclt 127 days ago [-]

> it only survives because of AI

I don't disagree that it's "complex, informally specified" (idk about bug ridden or silencing critics), but it's just silly to say it only survives because of AI. It was a top-used language before AI got big for web development, data science and all sorts of scientific analysis, and these haven't gone away: I don't expect Python lost much ground in these fields, if any.

bgwalter 127 days ago [-]

Dropbox moved parts from Python to Golang already in 2014. Google fired the Python team last year and I hear that it does not use Python for new code. Instagram is kept afloat by gigantic hacks.

The scientific ecosystem was always there, but relied on heavy marketing to academics, who (sadly) in turn indoctrinate new students to use Python as a first language.

I did forget about sysadmin use cases in Linux distributions, but they could be easily replaced by even Perl, as leaner BSD distributions already do.

guappa 127 days ago [-]

You'd be right if go wasn't an awful language designed by someone who clearly failed their compiler class at university.

127 days ago [-]

westurner 127 days ago [-]

strace is one way to determine how many stat calls a process makes.

Developers avoid refactoring costs by using dependency inversion, fixtures and functional test assertions without OO in the tests, too.

Pytest collection could be made faster with ripgrep and does it even need AST? A thread here mentions how it's possible to prepare a list of .py test files containing functions that start with "test_" to pass to the `pytest -k` option; for example with ripgrep.

One day I did too much work refactoring tests to minimize maintenance burden and wrote myself a functional test runner that captures AssertionErrors and outputs with stdlib only.

It's possible to use unittest.TestCase() assertion methods functionally:

  assert 0 == 1
  # AssertionError

  import unittest
  test = unittest.TestCase()

  test.assertEqual(0, 1)
  # AssertionError: 0 != 1

unittest.TestCase assertion methods have default error messages, but the `assert` keyword does not.

In order to support one file stdlib-only modules, I have mocked pytest.mark.parametrize a number of times.

chmp/ipytest is one way to transform `assert a == b` to `assertEqual(a,b)` like Pytest in Jupyter notebooks.

Python continues to top language use and popularity benchmarks.

Python is not a formally specified language, mostly does not have constant time operations (or documented complexity in docstring attrs), has a stackless variant, supported asynchronous coroutines natively before C++, now has some tail-call optimization in 3.14, now has nogil mode, and is GPU accelerated in many different ways.

How best could they scan for API tokens committed to public repos?

woodruffw 127 days ago [-]

pytest's magic is not itself a significant overhead factor. All test suite systems need to perform a similar type of collection; unittest does the exact same thing via `unittest.main()`.

zahlman 127 days ago [-]

Critics of Python don't get "silenced in their bubble" generally, just ignored.

Critics of the PSF, well, that's another story.

As for complexity, it's not so much that new features are added, but that people are using Python in larger systems, and demanding things to help manage the complexity (that end up adding more complexity of their own). The Zen of Python is forgotten - and that's largely on the users.

pytest is full of magic, but at least it uses that magic to present a pleasant UI. Certainly better than unittest's JUnit-inspired design. But it'd be that much nicer to have something that gets there directly rather than wrapping the bad stuff, and which honours "simple is better than complex" and "explicit is better than implicit" (test discovery, but also fixtures).

bgwalter 127 days ago [-]

> Critics of Python don't get "silenced in their bubble" generally, just ignored.

I disagree. The public bans are just the tip of the iceberg. Here is a relatively undocumented one:

https://lwn.net/Articles/1003436/

It is typical for a variety of reasons. Someone complains about breakage and is banned. Later, when the right people complain about the same issue, the breakage is reverted.

The same pattern happens over and over. The SC and the PSF are irresponsible, incompetent and malicious.

selfselfgo 127 days ago [-]

[dead]

Loading comments...

boyd 127 days ago [-]

Throwing cores at the problem with `pytest-xdist` is typically the lowest hanging fruit, but you still hit all the paper cuts the authors mention -- collection, DB fixtures, import time, etc.

cocoflunchy 127 days ago [-]

I don't understand why pytest's collection is so slow.

Galanwe 127 days ago [-]

From my experience speeding up pytests with Django:

- Stash your old migrations that nobody use anymore.

- Use pytest-xdist

- Disable transactions / rollback on readonly tests. Ideally you want most of your non-inserting tests to work on the migrated/preloaded features in your sqlite DB.

We can enter into more details if you want, but the pre migrated DB + xdist alone allowed me to speedup tests on a huge project from 30m to 1m.

caidan 127 days ago [-]

imp0cat 127 days ago [-]

Is there a way to use pytest-xdist and still keep the regular output?

kinow 127 days ago [-]

In their case I think they were no specifying any test path. Which would cause pytest to search or tests in multiple directories.

Another thing that can slow down pytest collection and bootstrap is how fixture are loaded. So reducing number or scope of fixtures may help too.

boxed 127 days ago [-]

I've done some work on making pytest faster, and it's mostly a case of death by a thousand paper cuts. I wrote hammett as an experimental benchmark to compare to.

piokoch 127 days ago [-]

Ehhh, those pesky Python people, complaining and complaining, average Spring Boot application takes 15s to start even looking if the code compiled ;)

thom 127 days ago [-]

ffsm8 127 days ago [-]

The spring tests are generally quicker then the equivalent python test, so ime - the jvm is mostly to blame.

How much time actually goes by after you click "run test" (or run the equivalent cli command) until the test finished running?

Any projects using the jvm I've ever worked on (none of which were clojure, admittedly) have always taken at least 10-15s until the pre-phases were finished and the actual test setup began

thom 127 days ago [-]

esafak 127 days ago [-]

It spins up a postgres container in that 20s?

thom 127 days ago [-]

Not a container but yes, it launches a cluster at the start of a run, and copies a blank Postgres template before every relevant test.

nine_k 127 days ago [-]

One thing not mentioned here is putting your test database on a RAM disk, aka tmpfs. This significantly speeds up all DB-related tests that use transactions, fixture loading, and migrations.

In most distros, /tmp is mounted as tmpfs, but YMMV.

qznc 127 days ago [-]

For example, a great speed optimization in our tests recently was to mock time.sleep.

Why do we have so many sleeps? This is testing a test framework for embedded devices where there is plenty of fiddling-then-wait-for-the-hardware.

I also mocked some file system accesses. Unit testing is about our application logic and not about Linux kernel behavior anyways.

NeutralForest 127 days ago [-]

lyu07282 127 days ago [-]

It's more work, but that's one benefit of clean architecture that abstracts the persistence layer. (You can replace it with an in-memory variant.)

NeutralForest 127 days ago [-]

I was using https://www.postgresql.org/docs/current/ddl-rowsecurity.html and needed to check that some complex policies were working correctly so I couldn't just replace with say, SQLite.

throwme_123 127 days ago [-]

Is Trail of Bits transitioning out of "crypto"?

Imho, they are one of the best auditors out there for smart contracts. Wouldn't be surprising to see some of these talented teams find bigger markets.

woodruffw 127 days ago [-]

Source: I run the group that produced this work.

[1]: https://github.com/trailofbits/publications

frogsRnice 127 days ago [-]

You all do amazing work, hope I can boast the same someday - or even 50% of it ;)

Seriously, you are my heroes!

bsamuels 126 days ago [-]

In addition to what Will posted, published reports for blockchain projects tend to be skewed compared to our other groups.

Blockchain clients tend to want to publish the report, but that isn't true for our business lines/projects/clients that are more interesting to HN's audience.

frogsRnice 127 days ago [-]

Imo its not just crypto- a lot of their reports are enlightening to read

ustad 127 days ago [-]

The article uses pytest - does anyone have similar tips when using pythons builtin unittest?

masklinn 127 days ago [-]

The sys.monitoring and import optimisation suggestions apply as-is.

If you use standard unittest discovery the third item might apply as well, though probably not to the same degree.

I don’t think unittest has any support for distribution so the xdist stuff is a no.

On the other hand you could use unit test as the API with Pytest as your test runner. Then you can also use xdist. And eventually migrate to the Pytest test api because it’s so much better.

kinow 127 days ago [-]

darkamaul 127 days ago [-]

Warning, there is a change in coverage 7.7.0 that disables sysmon support for coverage if using branch coverage _and_ a version of Python before 3.14alpha6.

[0]: https://coverage.readthedocs.io/en/7.8.0/changes.html#versio...

kinow 127 days ago [-]

Ah, thank you! I think you just saved me some time!

anticodon 127 days ago [-]

So, I'd run the tests under cProfile first.

dmurray 127 days ago [-]

But the changes in TFA were of the other of 75% improvement for "dumb" changes that were agnostic to the details of the tests being run.

anticodon 127 days ago [-]

> that actually changed the functionality under test - that's all an argument for doing exactly the opposite of what you recommend.

Generating a huge 4k image takes a significant time using Pillow.

bgwalter 127 days ago [-]

I get that pytest has features that unittest does not, but how is scanning for test files in a directory considered appropriate for what is called a high security application in the article?

For high security applications the test suite should be boring and straightforward. pytest is full of magic, which makes it so slow.

Python in general has become so complex, informally specified and bug ridden that it only survives because of AI while silencing critics in their bubble.

The complexity includes PSF development processes, which lead to:

https://www.schneier.com/blog/archives/2024/08/leaked-github...

williamdclt 127 days ago [-]

> it only survives because of AI

bgwalter 127 days ago [-]

Dropbox moved parts from Python to Golang already in 2014. Google fired the Python team last year and I hear that it does not use Python for new code. Instagram is kept afloat by gigantic hacks.

The scientific ecosystem was always there, but relied on heavy marketing to academics, who (sadly) in turn indoctrinate new students to use Python as a first language.

I did forget about sysadmin use cases in Linux distributions, but they could be easily replaced by even Perl, as leaner BSD distributions already do.

guappa 127 days ago [-]

You'd be right if go wasn't an awful language designed by someone who clearly failed their compiler class at university.

127 days ago [-]

westurner 127 days ago [-]

strace is one way to determine how many stat calls a process makes.

Developers avoid refactoring costs by using dependency inversion, fixtures and functional test assertions without OO in the tests, too.

One day I did too much work refactoring tests to minimize maintenance burden and wrote myself a functional test runner that captures AssertionErrors and outputs with stdlib only.

It's possible to use unittest.TestCase() assertion methods functionally:

  assert 0 == 1
  # AssertionError

  import unittest
  test = unittest.TestCase()

  test.assertEqual(0, 1)
  # AssertionError: 0 != 1

unittest.TestCase assertion methods have default error messages, but the `assert` keyword does not.

In order to support one file stdlib-only modules, I have mocked pytest.mark.parametrize a number of times.

chmp/ipytest is one way to transform `assert a == b` to `assertEqual(a,b)` like Pytest in Jupyter notebooks.

Python continues to top language use and popularity benchmarks.

How best could they scan for API tokens committed to public repos?

woodruffw 127 days ago [-]

pytest's magic is not itself a significant overhead factor. All test suite systems need to perform a similar type of collection; unittest does the exact same thing via `unittest.main()`.

zahlman 127 days ago [-]

Critics of Python don't get "silenced in their bubble" generally, just ignored.

Critics of the PSF, well, that's another story.

bgwalter 127 days ago [-]

> Critics of Python don't get "silenced in their bubble" generally, just ignored.

I disagree. The public bans are just the tip of the iceberg. Here is a relatively undocumented one:

https://lwn.net/Articles/1003436/

It is typical for a variety of reasons. Someone complains about breakage and is banned. Later, when the right people complain about the same issue, the breakage is reverted.

The same pattern happens over and over. The SC and the PSF are irresponsible, incompetent and malicious.

selfselfgo 127 days ago [-]

[dead]