web-development
, data
If my startup needs data from the web (~1M records), what would be the pros/cons of different methods to scrape those data:
Have you any experience in this?
Could you suggest how to evaluate money and time costs especially for the second case (doing it by IT department)? Also any special considerations your experience suggests we should factor in.
Best way to think about scraping data is that you're attempting to build an interface to an very bad API that may in fact vary in its response from one request to another, or require human intervention to bypass its features.
Per Mozenda, factors that affect the cost of a project include:
Writing custom scripts from scratch is a bad idea, since if the extraction is simple enough to write a script for, there's already going to be a free product out there to do it.
Given you haven't really provided much to go on, my suggestion would be to attempt to find three existing free products and attempt to pull a sample of the data, not all the data. Doing so will give you a much better idea of the requirements for your pulling the data. Say this in part since defining the count of records to be pulled, in this case you say 1-million records, means nothing in my opinion; for example, the records in question might just be million RSS feeds, or they might be on a site that actively blocks attempts to pull its data.
My Experience: Hello. I am writing based on my experience involving both the development, implementation, and automation of in-house tools (including web-scraping methods), as well as evaluating third-party tools. In my experience there are a lot of factors you need to take into account, but one major component to consider is determining how your current and long term resources are best invested. Here is a basic PRO and CON list of implementing for both methods. I hope this helps give you some points to consider.
IT Development PROs:
IT Development CONs:
Outsourcing PROs:
Outsourcing CONs:
Please consider that for every PRO and CON there is a counterpoint. For example, you may notice that the more rigid Third-party platform offers both a more focused output, but also limits how much customization or enhancement you can do.
The decision is very much dependent on your team’s strengths and what financial and time investments you are in a position to make.
In the end, if it is cheap enough to use the third party that is reliable and provides the service you need, it is good to go that route initially. Development can always be done later, but the investing in in-house development can often become quicksand.
Anecdotes:
Note: In your question it was unclear if you have considered using a third-party vendor, rather than a tool. For example, I work with a vendor that has pre-written scrape algorithms for specific sites I require data from. I provide them with the terms used to navigate to the specific data I need, but they provide both the GUI, code and processing resources (i.e. servers to run and store the data). They then provide us the data in a template output file. I am not familiar with Mozenda, but at first gglance it looks like just the tool. You may want to consider what requirements you might need to still invest in (servers, computers, bandwidth access, etc.).
All content is licensed under CC BY-SA 3.0.