Wednesday, September 26, 2007

Do you consider web scraping a threat to your organization?

Q: Do you consider web scraping a threat to your organization?
We have a client whose physician finder page was being scraped. A competitor was regularly sucking all of the doctors out of it and probably importing them as leads right into their own CRM database. We found a good solution but I also found out there are many software tools and service companies now who claim to be able to "collect data from the competition and track their behavior over time". I wondered how wide spread this might be and how much people were concerned about it in general. Any war stories or thoughts on web scraping ? Thank you!

A:
Hi ...,
Internet domain is named "public" for a reason. On the long term no protection (applets, images. scripts etc) is valid if the information is on Internet. There is a very common principle for information security. Security through obscurity is not real security.

The only information you may protect on an online directory system maybe the phone numbers and e-mail addresses (by not using them) You can proxy them via some web applications, but even if you do this, it is not very difficult to figure out all e-mail addresses if the names of the doctors are displayed on the pages (assuming that you follow a standard naming convention).

I have been working with enterprise security and web development teams for years and I remember stories from 1996. After the search engines (which are another kind of scrapers) the web/screen industry was legitimized

Scraping technology is relatively simple (programming 101) and in the long run there is no permanent fix. Yes, you can slow them down (no queries in 10 minutes from the same IP address?) but is not the solution.

I do recommend having very strict data classification, and privacy policies
• Identify the classified/sensitive/unclassified etc data
• personally identifiable information
• business and legal requirements (e.g. compliance)
• internal policies
at your operation and design your internet facing content according to your requirements. On public pages, classification is quite straight, so label all internet facing non-authenticated pages “public/unclassified”.

You can protect your private/high security demanding competitive data inside the perimeter, and protect your sensitive information with several DRM solutions.

If you have any question I would be happy to elaborate more.

cheers,
- yinal

No comments: