Looking beyond the ‘Chrome’

Google Chrome is an interesting story for all kinds of reasons. I am particularly interested in the V8 Javascript engine. There seems to be no shortage these days of different groups writing different VMs. Google in fact has a couple by my count – Dalvik the VM inside the Android platform, V8 the one powering Chrome and presumably also bundled w/ Android and even whatever VM powers AppEngine.  Both V8 and Dalvik have their own particularities. Dalvik is not a Java VM – it won’t run Java class files directly. It is register based, not stack based and has a different file format DEX but is closely tied to the Java language.  V8 looks like it is entirely geared towards executing Javascript – with hidden class discovery, and handling the super dynamic nature of JS.  Taking all this in I see one missed opportunity and a quiet validation of the Google structure.

The missed opportunity in Chrome and V8 is to deliver a VM that will support multiple languages.  I would like to see a ecosystem where browser makers embedded a VM and provide a standard set of op codes and foreign function interfaces – allowing it to work with multiple  languages.  Why tie the whole developer experience in the browser to Javascript? Sure there would be a Javascript compiler for the VM but hopefully there would be a Ruby, Python, Lisp or unknown future language compiler as well. In someways, I see MSFT attempting this with SilverLight and it could be argued as well that Java and it’s re-newed support for dynamic languages wants this as well. But without a VM that is  open, freely available, unburdened by historical and licensing baggage, I don’t think those efforts will ever really succeed. Google has a real opportunity to dramatically improve and diversify web development by delivering a VM standard and allowing for multiple languages in the browser. Come on guys!

The validation of the Google organization is subtle but think about why would they allow separate teams to implement different VMs all the while solving what appears to be very similar problems. Why write two JIT back ends, two GC implementations, etc? It is insane. It makes no sense. Also where is the CTO or Chief Architect or hell even the annoying marketing guy saying we are missing a branding opportunity (Dalvik Mobile, Dalvik Enterprise, Dalvik Browser)?  Clearly, this is not a normal company. They have a culture and management structure that are very different from what we have traditionally seen.  Google has set a vision for the company and then enables it’s employees to go out and execute. It sounds so simple and obvious but on the other hand it sounds almost unreal – what company has enabled you by giving you -  vast computing resources, 20% time, the ability and encouragement to switch teams, opportunity generate your own ideas and an internal marketplace for them, etc. In some happy world view of Google – they enable their employees and give them real choice in much the same way they aim to enable and give choice to their users. Eventually, the market place will decide if Android or Chrome are valued products but early on they are clearly significant technical achievements.

All this said, I am not always a fan of Google and I am sure some of the above is just myth. I think their short comings are numerous but they provide an interesting foil to most other companies operating today. Seriously, where are the plague of Harvard Business School books extolling the “Google Way” – we have seen a few but nowhere near the number one would expect.  Perhaps it is just to hard to capture in a book or that the real story actually eviscerates the crowd of people those books normally target.

Going overseas and factoring out the commonality with Python

Back in June, I had an opportunity to attend the BBC Mashed 08 event w/ some co-workers. We had a great time, checking out what is happening across the pond. But I think one of the best parts besides the beanbag chairs (which were really sweet) was the chance to just hack on some code for 24 hrs straight. It was a great break from the standard work cycle where it often seems uninterrupted coding is a rarity. Just jump in and start hacking. In the end, I built a search index and API of the BBC news content and others in my party built some super cool apps on top of it. I have been working on various aspects of search for a number of years but this was a great chance to really focus on a search project for a very short time frame.

One of the key pieces of the hack was getting the content. So I needed to write a fetcher and parser but need to save time to work on all the other pieces as well. Grabbing the content was pretty straight forward but I needed something that could parse the page and pick out the unique bits. I really only wanted the article text and not the navigational/promotional kruft. As a side note it was interesting to see the languages people choose to work with – roughly I think it was Python, Ruby, Java and then PHP. But It was very heavily weighted towards Python and Ruby. I am an old time C guy but for the this I decided to use Python. So here is a slightly shortened version of the fetcher/parser that I wrote in Python.

import urllib2,re,sys
clean = re.compile('\s+' )
tagremove = re.compile('< [^>].*?>' )

uniq = {}
items = {}
for url in file(sys.argv[1]):
    lines = [re.sub(clean,' ',re.sub(tagremove,' ',x)).strip() for x in urllib2.urlopen(url).read().split('\n' ) ]
    #clean off some kruft from javascript var - very BBC specific
    lines = [ x for x in lines if not x.startswith('var' ) and not x.startswith('switch(' )]
    for l in lines:
        uniq[l] = 1 + (l in uniq and uniq[l])
    items[url] = lines

for url,lines  in items.iteritems(): 
    print '------------------------------'
    print url
    print ' '.join([l for l in lines if l in uniq and uniq[l] == 1] ) 

This takes advantage of the templated nature of the BBC’s website. All the navigational elements are replicated on every page. Instead of combing through the HTML and trying to understand what the important portions are, this script does some something much simpler – it factors out the commonality for a group of pages. It pulls a bunch of URLs, breaks the documents into a lines, performs some cleansing – removing HTML tags and collapsing white space, counts the # of instances each unique line appears in the group of documents, and finally it loops through each document again and removes lines that occur more than once.

One of the important things to do is submit a set of URLs that will have similar navigational and supporting structure. I used a URLs pulled from BBC’s RSS feeds, which provided a nice grouping where navigational elements are likely to be very similar. Using the RSS also had an advantaged that the group was relatively small, since everything is stored in memory handing this script a 100K URLs would be problematic. This script needs to be run against small batches of URLs but nothing prevents it from being run in parallel.

I am not sure where I picked up this idea of factoring out the commonality. I have read numerous papers about shingling and various techniques similar to this but I never had good cause to actually implement something like this before heading over to Mashed08. It fit with what I needed to do and it was super easy to code up and tweak. Anyway, I think this approach should work for scraping out the text from all kinds of sites. Continue reading

Ahhh

Blogging for me is like singing in the shower. I am a terrible singer and it is solely for my own enjoyment. So this is my way of saying – I don’t expect much here.