Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lint multiple files in parallel [$500] #3565

Open
ilyavolodin opened this issue Aug 28, 2015 · 156 comments
Open

Lint multiple files in parallel [$500] #3565

ilyavolodin opened this issue Aug 28, 2015 · 156 comments
Assignees
Labels
accepted There is consensus among the team that this change meets the criteria for inclusion bounty Someone has posted a bounty for this issue on BountySource cli Relates to ESLint's command-line interface feature This change adds a new feature to ESLint needs bikeshedding Minor details about this change need to be discussed needs design Important details about this change need to be discussed

Comments

@ilyavolodin
Copy link
Member

This is a discussion issue for adding ability to run eslint in parallel for multiple files.

The idea is that ESLint is mostly CPU bound, not IO bound, so creating multiple threads (for machine with multiple cores) might (and probably will) increase performance in a meaningful way. The downside is that currently ESLint's codebase is synchronous. So this would require rewriting everything up to and including eslint.js to be asynchronous, which would be a major effort.

I played with this a little while ago and found a few libraries for Node that handle thread pool, including detection of number of cores available on the machine.

  • Node-threads-a-gogo - seems pretty good, but looks dead.
  • nPool - seems actively in development, but has native components (C++)
  • Node WebWorkers - seems pretty dead too.
  • Parallel - seems dead, and no pool implementation.
  • Node Clusters - not stable yet, and probably isn't going to be available on Node v0.10
  • WebWorkers - seems that they are only implemented in io.js
    And there are a ton of other libraries out there for this.

If anyone had any experience writing multithreaded applications for node.js and would like to suggest alternatives or comment on the above list, please feel free.

P.S. https://www.airpair.com/javascript/posts/which-async-javascript-libraries-should-i-use

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@ilyavolodin ilyavolodin added the needs bikeshedding Minor details about this change need to be discussed label Aug 28, 2015
@ilyavolodin ilyavolodin self-assigned this Aug 28, 2015
@ilyavolodin ilyavolodin added this to the v2.0.0 milestone Aug 28, 2015
@nzakas nzakas mentioned this issue Aug 28, 2015
11 tasks
@BYK
Copy link
Member

BYK commented Aug 28, 2015

Nice. I'm very interested in trying this myself too.

@ilyavolodin
Copy link
Member Author

Another question that I had in my mind, if we need to rewrite everything to be async, should we use callback pattern? Promises? If so which library? Q or Bluebird? I personally would prefer promises to callback hell.

@IanVS
Copy link
Member

IanVS commented Aug 28, 2015

I vote for promises. Bluebird is fast, but makes me nervous because it adds methods to the native Promise, which is likely not a great idea. I think Q might be the best bet. There is also Asequence, but I have no personal experience with it.

@gyandeeps
Copy link
Member

Why not use built in promises? Just a question as I have no experience with promises yet.

@IanVS
Copy link
Member

IanVS commented Aug 28, 2015

They're not supported in node 0.10, to my knowledge.

Besides that, the libraries give some nice "sugar" methods when working with Promises.

@btmills
Copy link
Member

btmills commented Aug 28, 2015

I've had plenty of success using native promises (or a polyfill when native promises aren't supported). That seems like a good starting point to me; if we need more than they provide we could probably swap out something that's API-compatible.

@nzakas
Copy link
Member

nzakas commented Aug 28, 2015

I think we're putting the cart before the horse here. Let's hold off on promises vs. callbacks until we're at least ready to prototype. Get something working with callbacks and let's see how bad it is (or not).

@nzakas nzakas added cli Relates to ESLint's command-line interface accepted There is consensus among the team that this change meets the criteria for inclusion feature This change adds a new feature to ESLint labels Aug 28, 2015
@lo1tuma
Copy link
Member

lo1tuma commented Aug 28, 2015

The idea is that ESLint is mostly CPU bound, not IO bound

ESLint also does a lot of IO (directory traversal, reading source files). So I think we would also profit here if we rewrite eslint to do non-blocking IO.

@ilyavolodin
Copy link
Member Author

@lo1tuma I haven't profiled it yet, but in my mind, amount of IO we do is negligible comparing to amount of CPU cycles we eat. I will try to profile it and post results here if I will get anything meaningful.

@pnstickne
Copy link

Using something like NodeClusters - or most other per-process implementations - would avoid the issue of needing to [vastly] rewrite ESLint. (Such implementations are strictly not threading, but allow managed parallel process execution.)

It would mostly just need to IPC-in/out the current ESLint; ESLint parallelism would then be able to work freely over different files in a per-process manner, but it could not (without more rework) run concurrently over a single file.

Thus if the goal is to run ESLint over different files in parallel I would urge such a 'simple' per-process concurrency approach. If the goal is to make ESLint parallel across the same source/AST then .. that is a more complicated can of worms as it changes the 'divergence' point.

If there is a strict v0.10 node target for ESLint, maybe have this as an feature when running a compatible node version.

@mysticatea
Copy link
Member

My idea is:

  • The master process does queue control and collecting result.
  • Worker processes have a short queue (the length is about 2-4).
    A worker does:
    • Sends the result of the last file and requests a path of the next next next file to the master.
    • Reads asyncly the next next file.
    • Lints the next file.

There are source codes which has a variety of size, so I think the queue control in pull-style is important.

Library.... I think child_process.fork is enough.

@gyandeeps
Copy link
Member

Sorry for this question: Based on all the comments, do we think the reward of this functionality is worth the effort/changes/complications? (like I said just a question)
Or is this functionality too early (like google glass) to implement without any actual estimates or ask or whatever.

@pnstickne
Copy link

@gyandeeps My projects are not big enough, or computer slow enough, for me to really care either way.

In cases where there are sizable projects of many files, on computers with several cores and non-bound I/O, I imagine that it could lead to significantly reduced wall-clock, approaching Amdal's law. I would be less optimistic about this gain with fewer larger files, even with 'actual threading' or post-AST handling - but that is what performance profiles are for.

Of course another option is to only lint 'changed' files and provide some sort of result cache, but comes with additional data management burdens.

@ilyavolodin
Copy link
Member Author

@gyandeeps To answer your question - we do not have definitive information on this. Right now my assumption is that we are CPU-bound, not IO. In that case utilizing more cores should have significant impact on larger projects (as @pnstickne mention, impact will be small or there might even be negative impact on a few large files).
I think the first step of the process would be to prove or disprove my assumption on CPU vs IO. Granted, if it turns out I'm wrong and we are IO-bound, then changing ESLint code to async would improve performance anyways.

@pnstickne Thanks for the insights. I'm really not familiar with NodeClusters, just know that they exist and do not require everything to be async for them to act. We definitely not going to try to make ESLint run AST analysis in parallel, that's going to be a huge change, and a lot of our rules expect nodes to be reported in the right order, so we would not only need to rewrite pretty much the whole core, we would also need to rewrite a lot of rules.
I would be interested in learning more how we could incorporate code that would only execute on node 0.12 and not on earlier version. I'm not sure that's the approach I would like to take, but it's an interesting option never the less.

@ilyavolodin
Copy link
Member Author

So I did some very unscientific performance analysis on both Windows and OSX.
I specifically chose to lint a very large number of files (TodoMVC, 1597 files, 854 folders).
On Windows, results where very inconclusive. Basically my CPU usage was never over 15% (on 8 core machine, none of the cores were overtaxed at any point). But then my I/O never hit anything over 10MB/s on a SSD that's theoretically capable of 550 MB/s. So I have no idea why it didn't run any faster.
On OSX it never hit more then 15% CPU (I have no idea how many cores my macbook has, probably 8 too), but I/O was pretty taxed. It looked like there were a few spikes that were reaching 100% disk throughput. So maybe my assumption was wrong any we are not CPU bound?

@pnstickne
Copy link

@ilyavolodin Try this:

Start running 8 different eslint processes (over the same set of files should be fine, although it would be 'more fair' to break up the files into 8 different equally-sized sets).

Compare the wall-clock times it takes for 1 process to complete and 8 processes to complete.

The 8 processes will have done 8x the work (if using the same set of files for each process as for the single process) or the same amount of work (if having split the source files among them), but in how much x the time?

This very crudely should show an approximate gain - if any - for using multi-process concurrency.

@platinumazure
Copy link
Member

Late to the conversation, but... Is anyone opposed to someone starting up a pull request to implement the callback hell approach (i.e., make ESLint async across the board, including for all I/O operations)? Seems like it would make the eventual parallelization easier anyway.

@IanVS
Copy link
Member

IanVS commented Aug 30, 2015

Of course another option is to only lint 'changed' files and provide some sort of result cache, but comes with additional data management burdens.
-@pnstickne

This is essentially what ESLinter from @royriojas does.

@pnstickne
Copy link

@IanVS That's pretty cool .. now if only it was built-in to my eslint grunt task :}

(Okay, I could get it shimmed in pretty easy - but it'd still be nice to see a 'done package'.)

@ilyavolodin
Copy link
Member Author

@pnstickne #2998
@platinumazure I think I would like to first prove that there's something to gain by going async/multithreaded just so nobody would waist their time creating a very large PR that will then be closed, because there is no gain.

@platinumazure
Copy link
Member

@ilyavolodin That's fair enough. I'm wondering, though, would it be worth
creating a separate issue with the goal of making ESLint's I/O operations
asynchronous? Synchronous I/O is a bit of an anti-pattern in Node and it
might help us figure out just how much of a breaking change parallelization
would likely be.
On Aug 30, 2015 9:46 AM, "Ilya Volodin" notifications@github.com wrote:

@pnstickne https://github.com/pnstickne #2998
#2998
@platinumazure https://github.com/platinumazure I think I would like to
first prove that there's something to gain by going async/multithreaded
just so nobody would waist their time creating a very large PR that will
then be closed, because there is no gain.


Reply to this email directly or view it on GitHub
#3565 (comment).

@ilyavolodin
Copy link
Member Author

@platinumazure While sync code is node anti-pattern, in our case, we can't do anything with the code (as in parse it into AST) until we read the whole file. So if improving performance is not on the plate, then changing code to async will increase complexity of the code, but not gain us anything. It's worth testing out and I still think there's a performance gain that we can get by doing parallel execution, but I would like to get some proof of that first.

@lo1tuma
Copy link
Member

lo1tuma commented Aug 30, 2015

@ilyavolodin reading files asynchronously doesn’t mean you read them chunk-by-chunk. While you are waiting for one file to be read, you can lint a different file which has been read already.

@paulbrimicombe
Copy link

@paulbrimicombe @nzakas Does this mean this feature is delayed until the next major version is out?

@borisyordanov unfortunately I'm not a maintainer on this project so I can't give you any details about the release schedule — I just raised a PR that addressed this issue. It looks like this work is blocked on approval of the RFC plus some in-flight refactor work that @nzakas mentioned above. In its current state, the RFC requires breaking changes (which I'm not convinced is at all necessary — certainly my PR doesn't have any breaking changes in it). Please comment on the RFC if you'd like to.

@nzakas
Copy link
Member

nzakas commented Jan 28, 2022

@borisyordanov there is no timeline for parallel listing right now. We have a bunch of other core work that must be completed first. When that core work is done, we can evaluate where we are in terms of parallel linting.

@jinlong
Copy link

jinlong commented Mar 22, 2022

I've just tried with jest-worker and it runs well

@leoswing How did you use jest-worker with eslint, if you just write a script to execute eslint command? Thanks

I've created a repo which supports ESLint with v7+, and it tests well with fast performance. I will try to update the codebase from my workspace to github these days. @jaysonwu991

Hi, @leoswing ,may i learn from your code example? I have got the same problem that linting multiple files is very slow. thanks a lot.

@gajus
Copy link
Contributor

gajus commented Mar 22, 2022

FYI, you can achieve this using https://github.com/jest-community/jest-runner-eslint

Our ESLint time has gone down from 13 minutes to 2 minutes

@paulbrimicombe
Copy link

When I was testing my parallel eslint implementation (now sadly binned because of the ongoing config refactor) I found that enabling caching made a huge difference (see https://eslint.org/docs/user-guide/command-line-interface for details on the CLI flags).

@gajus
Copy link
Contributor

gajus commented Mar 22, 2022

You cannot / should not use cache in production CI. Otherwise, yeah, it helps a lot.

I would still encourage adopting Jest runner if you are at a point where this is a bottleneck. Chances are, you have other bottlenecks that can be helped with this (like tsc).

@FlorianWendelborn
Copy link

@gajus can you elaborate on why you think you shouldn’t use cache in CI?

@gajus
Copy link
Contributor

gajus commented Mar 22, 2022

I would need to go back to reading ESLint documentation, but I am pretty sure that their cache does not use hash to identify if files have changed, but rather metadata (presumably last modified date). I would not trust this information to carry over through git accurately.

@FlorianWendelborn
Copy link

@gajus you can configure that:

--cache-strategy String Strategy to use for detecting changed files in the cache - either: metadata or content - default: metadata
https://eslint.org/docs/user-guide/command-line-interface

@gajus
Copy link
Contributor

gajus commented Mar 22, 2022

Great. Use it if that fits your bill. In our case, parallelization was preferred. We keep CI jobs stateless.

@bmish
Copy link
Sponsor Member

bmish commented Mar 23, 2022

Regarding caching: I implemented ESLint caching on CI for a large repository at one point but scrapped it because it can result in lint violations getting through to master/main in rare situations:

It is technically possible that a change in one file can cause a linting violation in another unchanged file. For example, eslint-plugin-import does things like validate that imported modules are actually exported from the file they’re being imported from. Unfortunately, I believe this defeats any attempt at avoiding a full eslint run.

Note that aside from this edge case, caching on CI can work successfully most of the time, as long as you use:

  1. eslint --cache --cache-strategy
  2. A cache file (for --cache-location) with a name that includes a hash of anything that can affect your linting:

For the record, here's the imperfect eslint CI caching script I put together. I don't really recommend anyone tries to use it given the limitations but am including it for context.

imperfect eslint CI caching script
#!/usr/bin/env bash

EXIT_STATUS=0

if [[ "$ENVIRONMENT" == "CI" ]]; then
  # Use kochiku's shared cache to cache the results of running ESLint between builds.
  # Depending on the number of files changed, caching can reduce lint running time from 10+ minutes to just seconds.

  # Generate a hash based on:
  # * The installed eslint package versions
  # * The code in eslint-plugin-dashboard (since this package is not versioned)
  # We use this to invalidate the cache if any lint packages are updated since
  # eslint caching itself does not take this into consideration: https://github.com/eslint/eslint/issues/12578
  HASH_ESLINT_PACKAGES="$(yarn list --pattern *eslint* --depth=0 | grep eslint | md5sum | awk '{print $1}')"
  HASH_ESLINT_PLUGIN_DASHBOARD="$(find ../packages/eslint-plugin-dashboard/lib -type f -name "*.js" -exec md5sum {} + | awk '{print $1}' | sort | md5sum | awk '{print $1}')"
  HASH="$HASH_ESLINT_PACKAGES-$HASH_ESLINT_PLUGIN_DASHBOARD"
  echo "ESLINT CACHING: Generated hash of eslint packages: $HASH"

  PATH_CACHE_ESLINT_DIR="/mnt/nfs/shared-cache/dashboard/eslint/"
  PATH_CACHE_ESLINT_FILE="$PATH_CACHE_ESLINT_DIR/$HASH"
  PATH_CACHE_ESLINT_TEMP_FILE="$(mktemp -u)"
  if [ "$GIT_BRANCH" == "master" ]; then
    # Every master build performs a full eslint run and caches the result for non-master builds to use.
    # Note: we use `--cache-strategy content` because the default caching strategy of mtime is useless to us (git does not maintain file mtime, so mtime will always be the time when the current build cloned the repository).
    yarn eslint --cache --cache-strategy content --cache-location "$PATH_CACHE_ESLINT_TEMP_FILE" . $@ || EXIT_STATUS=1
    rm -rf $PATH_CACHE_ESLINT_DIR # Delete any old cache files.
    mkdir $PATH_CACHE_ESLINT_DIR
    cp $PATH_CACHE_ESLINT_TEMP_FILE $PATH_CACHE_ESLINT_FILE
    echo "ESLINT CACHING: Generated new cache."
  else
    # Non-master builds use the cache generated by the last master build.
    # Note: a temporary file is used to avoid overwriting the master cache.
    if [ -f "$PATH_CACHE_ESLINT_FILE" ]; then
      echo "ESLINT CACHING: Found cache to use."
      cp $PATH_CACHE_ESLINT_FILE $PATH_CACHE_ESLINT_TEMP_FILE
    else
      echo "ESLINT CACHING: No cache found."
    fi
    # Note: we use `--cache-strategy content` because the default caching strategy of mtime is useless to us (git does not maintain file mtime, so mtime will always be the time when the current build cloned the repository).
    yarn eslint --cache --cache-strategy content --cache-location "$PATH_CACHE_ESLINT_TEMP_FILE" . $@ || EXIT_STATUS=1
  fi
  rm $PATH_CACHE_ESLINT_TEMP_FILE
else
  yarn eslint --cache . $@ || EXIT_STATUS=1
fi

exit $EXIT_STATUS

@leoswing
Copy link

leoswing commented Nov 1, 2022

I've implemented a eslint api with multiple worker thread support,

@sam3k
Copy link
Contributor

sam3k commented Feb 23, 2023

Here is the TL;DR on the status of feature as of February 23, 2023:

  • "There is no timeline for parallel listing right now. We have a bunch of other core work that must be completed first. When that core work is done, we can evaluate where we are in terms of parallel linting" -- @nzakas
  • Some of the pre-requirements have been completed so we are moving in the right direction
  • There is still a $500 bounty for this
  • Official RFC
    • RFC has been closed. Per @nzakas: "With all of the internal changes, we are going to need to rethink how to move forward with this, so closing.". (Please see TSC summary below for more info)

TSC Summary

ESLint assumes that each rule and source file can be processed independently. typescript-eslint (ref #42 (comment)) and eslint-plugin-import (ref #42 (comment)) need to do upfront initialization work beyond the scope of a single rule and source file, specifically loading type information and tracing a module graph. Lacking a first-class API, they have inserted these initialization steps into the regular rule linting flow.

If we were to ship parallel linting without supporting this use case, the duplicated initialization could make parallel linting slower than single-threaded linting with these plugins. The large number of ESLint users who also use one of these plugins would not benefit from parallel linting.

This API would need to provide the plugin with the config and list of files to be linted, and parallel linting would need a way to share the result with workers.

Pre-requirements

One of the main reasons why we cannot proceed with this issue is that a lot of work needs to be done to the core APIs in order to even move forward. Some of the things discussed:

  • Moving cache helper utils into its own module. PR
  • CLIEngine was deprecated in Eslint v8 in favor of Eslint class
  • Eslint v7 public API (Provides async methods which will allows us to do parallelism)
  • FlatConfig needs to be completed

Alternatives Solutions in the Meantime

jest-runner-eslint

Considering looking jest-runner-eslint. Some folks have seen considerable performance improvements.

Our ESLint time has gone down from 13 minutes to 2 minutes

Esprint

Esprint could be another alternative in the meantime


Please let me know if I've missed anything and I'll update this comment.

@nzakas
Copy link
Member

nzakas commented Feb 27, 2023

Awesome work @sam3k!

We are leaving this open because we do still intend to work towards this.

@Rugvip
Copy link

Rugvip commented Mar 3, 2023

Just wanna chime in that we've implemented a solution based on running ESLint in worker threads in the Backstage CLI. It provided a massive speed increase as well, especially compared to if you were running lint tasks through Lerna.

@nzakas
Copy link
Member

nzakas commented Mar 7, 2023

@Rugvip nice! Thanks for sharing.

@Yankovsky
Copy link

FYI, you can achieve this using https://github.com/jest-community/jest-runner-eslint

Our ESLint time has gone down from 13 minutes to 2 minutes

Can you provide an example config? I can't make jest-runner-eslint work for some reason.

@stevenpetryk
Copy link

stevenpetryk commented Jun 9, 2023

A lot of discussion about parallel linting is taking place in #14139, because it will affect plugins who have a significant initialization step. Some salient points from that conversation:

  • Plugins would ideally have the ability to determine how files are split up across workers (for example, typescript-eslint would benefit from being able to shard by TypeScript project so that it's more likely that a single shard would entirely encapsulate one or more TypeScript projects).
  • Plugins would need some kind of shared state that they could establish in an initialization step.
  • Parallelization is likely not worth pursuing until these problems can be solved in a way that supports eslint-plugin-import and typescript-eslint, which are used by 60-70% of ESLint users already.

@nzakas
Copy link
Member

nzakas commented Jun 13, 2023

@stevenpetryk thanks for the summary! Very helpful to have all of this info in one place.

@faultyserver
Copy link

faultyserver commented Dec 2, 2023

Wanted to chime in here given the hype about speeding up tools.

tl;dr: ESLint can run in parallel with no issues, and the overhead of duplicated work is negligible: main...discord:eslint:parallel. This adds an opt-in --parallel flag to distribute work across multiple cores, but only if the user decides it's helpful.


We've been running ESLint on our monorepo in CI for a long time. As the codebase grew, ESLint started taking longer and longer to run. A few years ago, we had about 10k files, and ESLint would take ~3 minutes to run on CI. Then we wanted to add TypeScript rules to do some additional validation, but unfortunately adding those rules kicked up our ESLint run time to almost 8 minutes. Eventually, we split out the TS rules from all of the other ones and ran them as two separate jobs, which actually worked pretty well. Normal ESLint was back down to about 3 minutes, and the TypeScript job also took about 3 or 4 minutes to run just those rules.

Fastforward to about 2 months ago, and our codebase has continued to expand, now almost 15k files, well over half a million LOC. Normal ESLint now takes 5 minutes to run, and TypeScript rules are getting close to 6. It's pretty infeasible to run them locally at this point, and they're becoming the bottleneck of our CI pipeline. I wanted to do something about it, and toyed with splitting up ESLint jobs even more, but none of that worked too well. Most importantly, reporting on errors became a big hassle.

After some quick investigation, it became pretty clear that the time-consuming part of ESLint isn't the rules...at all. We run TypeScript rules, eslint-plugin-import, and a number of expensive in-house rules that are all fairly expensive to check, but when I looked at the runtime of things, it didn't really add up. The real kicker is just the number of files and size of the code being processed. Iterating 15k files and traversing their ASTs just takes a really long time to do. Period. The actual runtime of all of our rules was only about 12% of the total runtime of ESLint. The rest was spent just doing that iteration.

Taking inspiration from @mixer/parallel-prettier, I decided to just implement a worker model that sends batches of files to worker processes and then collects the results into a single report. Getting that to work with the CLI meant forking ESLint and building directly inside of the codebase, since the internals aren't directly exported for modern versions of Node, but after some initial setup and exploration, I ended up with something that seemed to work, gave all the right results, and printed them out in a single report at the end: main...discord:eslint:parallel. The best part of this is that there's no worry about incorrect configuration or missing files, because it's using all of the same ESLint internals, just dispatching the work to be done across multiple cores.

Then it was time to see just how much things improved.

Running locally on my M2 Max with 12 cores, the normal ESLint run would take about 95 seconds to run (the 3-5 minute numbers from earlier were on CI, not our local machines). With parallelization across all of those cores (11 workers), the total run time was reduced to...20 seconds. A huge return, more than a 75% speedup. Then I wanted to see how TypeScript was affected. As mentioned in this thread, the TS rules do a lot of work scanning the entire project, and it uses a lot of caching that would be nice to have persistent across everything, but how much does that affect the run? Well before, the job with just those rules would take about 80 seconds to run. In parallel...24 seconds. Basically the same speedup, despite not having as good of a cache to rely on.

Why? Because it really is just outweighed by the number of files being operated on, and the fact is that TS is itself a single-threaded workload. So all of that scanning work also ends up being run in parallel, and the only actual overhead is the slightly higher amount of time spent computing things that might have been cached already if they were all run on a single core. But even then, all of that work has to be done at some point on a single thread anyway.

Our CI times, for reference, have a very similar level of improvement. We use 6 core machines, and the runtime of normal ESLint has dropped from 4 minutes to just over 1 minute, and TS rules from 4 minutes down to just under 2 minutes. Definitely some relative overhead, but still a massive improvement.


All of this story is to say: I think the debate about how to design a good API for making parallel ESLint truly efficient is absolutely worthwhile. A way to cache work across multiple workers would be great, or to have some input on how files get batched to make it more efficient. But none of that is necessary to see huge, safe improvements in runtime just by distributed the work naively.

I'd love to upstream these changes into ESLint itself, both so that everyone else can benefit from them, but also so we don't have to maintain a fork lol. I think it's probably close to being able to be merged, sans a few stylistic things, and of course general review. I'm happy to make a PR to get that process started if it would be welcome.

@timocov
Copy link

timocov commented Dec 3, 2023

Then I wanted to see how TypeScript was affected. As mentioned in this thread, the TS rules do a lot of work scanning the entire project, and it uses a lot of caching that would be nice to have persistent across everything, but how much does that affect the run? Well before, the job with just those rules would take about 80 seconds to run. In parallel...24 seconds. Basically the same speedup, despite not having as good of a cache to rely on.

@faultyserver Does your typescript configuration include type-check rules as well?

@faultyserver
Copy link

@faultyserver Does your typescript configuration include type-check rules as well?

Yep! The TS job is specifically for the rules that require type checking. We left one or two non-checked ones like typescript-eslint/naming-convention in the default job since they run quick and are more likely to be encountered when people are writing code, then the TS job only runs in CI.

It would be great to be able to run them all locally with the LSP, but it just becomes too slow in our codebase (like more than a second delay between hitting save and the file actually saving, etc).

@fasttime
Copy link
Member

fasttime commented Dec 5, 2023

@faultyserver Thanks so much for sharing this work! It's very interesting to see an update with a functioning prototype in this long-standing discussion.

Since this is such a popular issue, I thought I'd also put my 2 cents in. I have a tool for parallelizing ESLint which I've only been able to test on a few projects so far, so it's probably more experimental than yours. For anyone interested, the repo is here: https://github.com/origin-1/eslint-p (just updated to ESLint v8.55).
It's not a tuned-up version of ESLint, but rather a separate package that wraps the CLI and other components (way too many additions to integrate into ESLint without design/RFC). An important difference is that it only runs on projects that use the flat config, not the legacy eslintrc config. This makes it somewhat complicated to compare the results against your solution. Also, I'm not using multiple child processes to split the workload but multiple worker threads.

The results look promising for large codebases, but not so good for smaller ones, especially when typescript-eslint is used. Clearly the overhead involved in setting up and synchronizing the workers can easily exceed the benefits of parallelization. Nonetheless I was able to lint the ESLint repo (JavaScript only) in just less than 8 seconds instead of around 16 seconds on an M1 machine with --concurrency=4 (50% speedup).

Maybe from a user perspective it would be best to autotune the process depending on the type of the project, the number of files to be linted, the environment, etc. without asking people to set an option (I like the way you set the concurrency to os.cpus().length - 1 by default). I would also love to hear some feedback from other users.

@faultyserver
Copy link

Great to know there are other attempts at this happening too!

Some quick feedback from what I've learned in our parallel implementation:

Also, I'm not using multiple child processes to split the workload but multiple worker threads.

The main reason for using processes rather than threads is to reduce contention on the process making syscalls to read the file system. Unix only gives so many accesses at a time to a process (citation needed, but i discovered that while doing this implementation), all the threads being in the same process means you're potentially hitting that bottleneck, and the single Node VM also has to wait on those syscalls. Distributing over processes means the VM has less contention. It can also avoid context switching within the VM when changing between threads, which also helped some.

Our implementation also lets each worker resolve the config as it goes, so there's no memory sharing necessary at all, making processes much easier to switch to.

The results look promising for large codebases, but not so good for smaller ones, especially when typescript-eslint is used. Clearly the overhead involved in setting up and synchronizing the workers can easily exceed the benefits of parallelization.

Dispatching each file individually between threads or between processes is going to slow down the parallelized version a lot. That's more time spent in serialization and doing IPC (or at least memory copies). Copying a batch of 50 file paths at once and getting the results in one go will be a lot faster than doing each one individually (2 messages vs 100). If each message takes 10ms to send, that's 10ms vs half a second spent communicating).

Batching also ensures that you only spin up new workers when they're actually useful. Fewer than 50 files? It won't even try to spin up more than 1 worker. Fewer than 200? Never gonna spawn more than 4. 50 seems to be around the boundary where the time to spawn a process becomes negligible, but that's still an arbitrary number. 100 in a batch could also work well, or 20. Depends a lot, but in reality it doesn't matter much.

But that leads to the biggest thing, which is having a pool of workers rather than spawning a new one every time. A new process will take a lot longer to spin up (granted it's less than you probably think, maybe 150ms at most on an M2), but even threads have an initialization cost. Keeping a pool of workers around and queueing them to receive messages will bring down the overhead a lot as well. Synchronizing is also pretty minimal if you keep track of "how busy" each worker is and dispatch to the least busy one. That's pretty cheap to do and helps keep the load balanced over time.

Nonetheless I was able to lint the ESLint repo (JavaScript only) in just less than 8 seconds instead of around 16 seconds on an M1 machine with --concurrency=4 (50% speedup).

With pooling and batching, even sticking with threads, I think this could get up to maybe 70% speedup with 4 threads, probably a little higher with processes, but not too much more, just because the overall time is relatively low, so initialization is a little more costly.

Maybe from a user perspective it would be best to autotune the process depending on the type of the project, the number of files to be linted, the environment, etc. without asking people to set an option.

For everything up to medium-large size projects (a few hundred files), and with batching, there would likely never be more than 4 or 5 workers going at a time. The only way to really know what's going to be the fastest for a given project, though, is to run it a few times and see which one gives the best result. It's so dependent on the number of files, but also the size of those files (and for typescript, the type-wise complexity). Keeping parallelism as opt-in entirely also ensures that small usecases that are super sensitive to initialization costs stay as fast as possible (i.e., my project with 20 files could lint in 60ms, less than the time needed to spawn a single worker. it's definitely not worth doingt then).

@fasttime
Copy link
Member

fasttime commented Dec 5, 2023

The main reason for using processes rather than threads is to reduce contention on the process making syscalls to read the file system. Unix only gives so many accesses at a time to a process (citation needed, but i discovered that while doing this implementation), all the threads being in the same process means you're potentially hitting that bottleneck, and the single Node VM also has to wait on those syscalls. Distributing over processes means the VM has less contention. It can also avoid context switching within the VM when changing between threads, which also helped some.

It's totally possible that the OS restricts the number of simultaneous I/O accesses for a process, I'm not sure though if the bottleneck limit would be hit already at a concurrency of 2 or 4. It seems more likely that Node.js itself is applying some sort of synchronizing here. And it's totally possible that switching context across threads in Node.js is slower than switching the context to a different process.

Dispatching each file individually between threads or between processes is going to slow down the parallelized version a lot. That's more time spent in serialization and doing IPC (or at least memory copies). Copying a batch of 50 file paths at once and getting the results in one go will be a lot faster than doing each one individually (2 messages vs 100). If each message takes 10ms to send, that's 10ms vs half a second spent communicating).

Batching also ensures that you only spin up new workers when they're actually useful. Fewer than 50 files? It won't even try to spin up more than 1 worker. Fewer than 200? Never gonna spawn more than 4. 50 seems to be around the boundary where the time to spawn a process becomes negligible, but that's still an arbitrary number. 100 in a batch could also work well, or 20. Depends a lot, but in reality it doesn't matter much.

But that leads to the biggest thing, which is having a pool of workers rather than spawning a new one every time. A new process will take a lot longer to spin up (granted it's less than you probably think, maybe 150ms at most on an M2), but even threads have an initialization cost. Keeping a pool of workers around and queueing them to receive messages will bring down the overhead a lot as well. Synchronizing is also pretty minimal if you keep track of "how busy" each worker is and dispatch to the least busy one. That's pretty cheap to do and helps keep the load balanced over time.

Batching is an interesting idea I haven't explored enough. I'll try and see if that helps! My biggest problem at the time was that there weren't just enough big repos using flat config that I could test. Pooling is already in place. To be honest I haven't even tried to do without, as each thread needs to have its own copy of the config and (in the case of typescript-eslint) of the project metadata. Unless I'm overlooking something, loading a fresh copy of the data for each file in a new worker would be terribly slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted There is consensus among the team that this change meets the criteria for inclusion bounty Someone has posted a bounty for this issue on BountySource cli Relates to ESLint's command-line interface feature This change adds a new feature to ESLint needs bikeshedding Minor details about this change need to be discussed needs design Important details about this change need to be discussed
Projects
Public Roadmap
  
RFC Under Review
Status: Waiting for RFC
Core Roadmap
Needs Design
Development

Successfully merging a pull request may close this issue.