Web scraping with python + selenium + multiprocessing in a docker container beco...

Question 1

I have built a web crawling solution with python, selenium and multiprocessing which is deployed in a docker container in an EC2 instance (m4.2xlarge type). Whenever I run it with a large input, it uses a specified no of CPU threads in the beginning till like ~1000 URLs, after that it starts to use less number of threads and crawling becomes super slow because of that.

Looking for a debugging method for the same to understand why the program starts to use less number of CPU threads/cores over time.

Question 2

Answer based on "logic":

Solve the problem: Set a limit for maximum number of threads?

Reason:
The efficiency of task/process/thread switching is likely to have an efficiency limit at some (undefined) level.

From the above I conclude that it is peaking at 1000+ threads on your VM/OS/Hardware combination; others might see it at a different level - depending on the actual situation.

The OS is already running a good number of process instances before you start yours.