06/16/2021 14:15 | Category: project

Tags: ideaprojectscraping

web scraping service

Jordan and I had a project idea for automated web scraping that would be a competitor to the ScraperBox using a Docker container system.

Services

We would just need to serve API keys to the client then:

  • Have them send requests to us with the URL to scrape
  • More premium users can persist data on our end?
  • Return the HTML
  • Maybe have a generalized "sanitation" for known types?
    • Try to parse to JSON as best as we can (cleanly)
    • User defined JSON parsing fields in the request?
  • Avoids captchas using modern techniques? (Premium)
  • Does JavaScript using Selenium (Premium)
  • Offer request concurrency for more money (Premium)
  • Proxies? (Premium)

Structure

Docker Django Celery -> Autoscaling worker containers PostgreSQL -> RDS