Web crawling services

The number of web apps that need to crawl the web in some form or another is so huge, and it’s becoming bigger everyday, that either I am the stupidest person on Earth and can’t Google properly or there’s none selling web crawling services.

Folks, someone needs to do this. A metered service (like S3) where costumers can query you app for crawling results.

I am going to give you two reasons why I should do this myself.

Reason number one:
It’s cheaper and not someone elses core competency. How does Friendfeed index all these webpages? Who cares? They shouldn’t be doing this. Writing a good web crawler is hard. They need the *data* when it’s is *new*.

Reason number two:
I have so many ideas, but I want to focus on prototyping them instead of writing the crawler. It would really help devs around the world if they could just use some API to crawl webpages.

Did I say API? Yes, that’s the point. Someone needs to write a crawler with an API:

POST /api/i=http://www.example.com/file.html
user=name
pass=word
when=00 00,12 * * 1-5
expires=2592000

Yeah, that’s the crontab syntax. “when” would also accept “once” and “onchange”.

“expires” is the number of seconds (since now) that this crawl won’t be needed anymore.

This request would return an “id”, to be used later, when the costumer is ready to download the webpage from us.

Of course there’s also:

POST /api/i=regex
format=rss
content_regex=some_string(.*)sucks?

So you know when someone says your product sucks. And:

POST /api/i=regex
name=(jpg,gif)
width=LT200
height=LT200
type=image

LT is Less Than, there would be also GT and EQ.

So, /api/i= is to insert a crawling request. You can request webpages by /api/g=

POST /api/g=http://www.example.com/file.html
only=#some_node_id .some_node_class

XPath on “only”.

Since the costumer would pay for data transfered, it would suck to have the costumer to query /api/g= everytime he needs something. And it’s not much different than writing your own crawler, isn’t? Actually it is because of robots.txt, html parsing, server load, and much more. But a lot of people think that writing crawlers is easy and scalable.

Anyway! The magic happens when you crawl a webpage and it matches some rule set by one of your costumers. Now you just need to tell them the list of ids previously sent by /api/i= that are ready. They connect to your server and download the files.

And if you have ids:

POST /api/g=111,112,113
compress=True

Which would return insert requests with ids 111 and 112 and 113 in a zip file.

One more good thing: Economy of scale. Everyone needs the newest RSS feeds. You can have dozens of costumers requesting the same feed, but you only will need to grab it once.

This service would have nothing to do with search, Google, deep web, semantic web, whatever. Just make sure people will know when a webpage is updated.

Comments

Leave a Reply