Web scraping with python + selenium + multiprocessing in a docker container becomes slower over time
I have built a web crawling solution with python, selenium and multiprocessing which is deployed in a docker container in an EC2 instance (m4.2xlarge type). Whenever I run it with a large input, it uses a specified no of CPU threads in the beginning till like ~1000 URLs, after that it starts to use less number of threads and crawling becomes super slow because of that.
Looking for a debugging method for the same to understand why the program starts to use less number of CPU threads/cores over time.
Top Answer/Comment:
Answer based on "logic":
Solve the problem: Set a limit for maximum number of threads?
Reason:
The efficiency of task/process/thread switching is likely to have an efficiency limit at some (undefined) level.
From the above I conclude that it is peaking at 1000+ threads on your VM/OS/Hardware combination; others might see it at a different level - depending on the actual situation.
The OS is already running a good number of process instances before you start yours.
상단 광고의 [X] 버튼을 누르면 내용이 보입니다