In his post “Could Someone Explain Technorati” Chris Brogan wonders about the consistency, accuracy and reliability of Technorati service. I can’t explain the behavior of the system over there but I can share some of my experience dealing with different challenges using online APIs (web services) and data. The objective here is to help other mashupers to better prepare for future integrations effort across multiple web services. Since it appears that the mashupers community is growing faster than the web service provider I’m sure that more fellow API consumers can share some stories of their own. I will be happy to hear about.
I see three participants perspectives in this “love triangle”: the web site visitor, the mashuper (the API consumer) and the service provider.
My visitor experience:
Chris Brogan talks about his experience from the user perspective in his post. I have nothing to add here but I would say that as a service provider, this should be my top concern satisfying my loyal community. Maybe the way to deal with this in the case from Chris’s post is by monitoring for exceptions (drastic rise or fall in the rank/authority).
My mashup experience:
As I mentioned in some of my earlier posts (here, here and here) I’m working on a small project for finding productive bloggers by monitoring for consistent improvements in their Technorati rank. So on a frequent basis I monitor the rank for over 800 bloggers now. I plot some of the result to a designated Twitter account: blogmon.
The first set of challenge is dealing with volatile data:
- Some times I see no authority in the results (inboundblogs).
- Some times there is no valid last update date in the results: <lastupdate>1970-01-01 00:00:00 GMT</lastupdate>
- Most time there is no author (the user did not add it)
- Some time there are no tags (the user did not add it)
- Some time as Chris mentioned the rank is off for a short period of time
For example see Seth Godin’s Blog rank history:
last update rank authority
2/12/2008 19 8599
2/25/2008 18 8697
3/17/2008 19 8658
3/22/2008 16 8827
4/10/2008 15 8946
4/19/2008 16 8882
4/23/2008 17 8819
5/12/2008 17 8828
5/14/2008 16 8863
5/20/2008 15 8890
These are the details that a consumer of online volatile data must plan and look for ways to compensate for.
- Check the validity of the date
- Don’t just count on the last result i.e. search for the last valid result and monitor over time.
- Be prepare to plot partial results (e.g. no top tags or author).
- Most important: guard your data i.e. protect what that you take from the service and store in your records.
The next set of challenge has to do with the web service behavior:
- I get the fowling error once or twice: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.
- Some API requests come back with:
<META HTTP-EQUIV=”REFRESH” CONTENT=”2; URL=http://api.technorati.com/bloginfo?url=****&key=****&version=0.9&start=1&limit=30&claim=0&highlight=0″>
**I intentionally masked the URL, title, image and my developer Key with ****
This result can crash your system if not handled.
- Finally: and I get this one a lot:)
<?xml … “http://api.technorati.com/dtd/tapi-002.xml”>
<error>You have used up your daily allotment of Technorati API queries.</error>
- I can’t picture my dev world without Exception Handling – this is the ultimate protection against web service unexpected behavior in this specific case. So guard any call, loading XML result and data parsing by wrapping them with a try and catch block.
- Logging – log expected and unexpected behavior for later analysis and recovery.
- Build the system so exceptions are caught, logged but the execution can move on to the next task.
- This is something that I learned from a smart Army office: “If there is a doubt there is no doubt” basically saying that it is better to not report at all than to report inaccurate data.
- Find ways to minimize the API calls – e.g. I ask for tags only when I find a blog worth reporting on
- A thought: I’m not an expert in XML and DTD but could it be that using DTD slows down the web service. If you know more about it please share with me/us. Is this really necessary on a read only calls?
About the service:
I can’t talk much about what that a web service provider feels or experience (I’m sure that Ian Kallen from Technorati has a lot to share about this subject) but I want to say few things:
- Please don’t get this post wrong I’m a fan of Technorati – I use it and deeply appreciate their service and thankful for having the option using the APIs . As I said earlier the intention is to share from experience and to allow you to better prepare for such effort.
- I guess that it is hard to estimate the load on the system with such growth in the number of mashupers out there. So my heart is with them.
- There are two more threats that the web service provider needs to protect itself from and I’m sure that those consume some energy: protecting the hard gather data and its environments from abuse and malicious attacks.
One last comment: ironically I had none problems with Twitter so far:) but I’m aware of the pain that some of the Twitter API user suffer occasionally.
Things are going real-time. Social network, decentralize me, or semantic web are all great and we will see more of those coming but the most disruptive change that is already happening is that the web brings us all tougher in real-time. It allows us to read/write the web at any time from anywhere.
There are few catalyzers for this change. The first is the mobile technology including the devices, integrated GPS and the bandwidth, the second is that we are getting use to be always connected, and the third is all the good stuff that is happening on the web.
Now, I’m not talking about the future. I’m talking about the here, now and more just a little later.
Twitter made the connection between SMS and the web like no other tool before. There is nothing more real-time than Twitter. News, gossip, agenda, help, you name it.
BriteKite – Location-based social networking – in real-time.
What stopping us?
I think that the biggest obstacle for mass adoption, at this point, is the cost of the data plan. If that cost will go down we can all enjoy:
Read: mobile constant feeds aggregation getting location, timely, and profiled based information.
Write: sharing our current experience using text, audio and video.
So what will happen a little later?
- Creating real-time social events
- Getting recommendation as we approach a new location that are tailored to our profile
- Knowing who’s in the area (that could be annoying too – it requires some configuration about who you want to know that you are around).
- Car pooling
- Checking for bargains in the area (pool – not push)
- Possible: Instant check-out – just go through the door with the purchased items and check the bill(one at a time) – combination of RFID and the mobile device (beware if someone steal your phone).
- Mobile Peer-to-Peer – sharing the information among us off the server
Can you see more ways that WebRT can enrich your experience in real-time?
Update 5-12-08: I guess that I my timing was good:) RIM Introduced today the BlackBerry Bold Smartphone including GPS, Wi-Fi® support and Video camera.
Here are some of my experience dealing with lots of information. How do you do that?
I use Netvibe Ginger to consume most feeds from blogs that I like reading. It helps me with my constant battle with another “Inbox 0″ front on my personal email account. By copying blog feed’s urls to Ginger I avoid more emails to clear from my already overloaded inbox.
I have multiple tabs set for grouping my feeds by categories. I don’t visit every tab every day. Some I rarely visit.
There are two blogs I like and respect enough that I actually subscribed to their feeds by email: TechCrunch and ReadWriteWeb. In this way I’m getting for sure every day two doses of what is going on in the Internet world.
These two has some similarity but they do not overlap and reading both keep me in the know.
I actually just finished reading The Stats Are In: You’re Just Skimming This Article post on ReadWriteWeb that inspired this post.
This post talks about the way that people read or more correctly skim the web. It quotes stats from Jakob Nielsen relying on a research study done by Harald Weinreich, Hartmut Obendorf, Eelco Herder, and Matthias Mayerto support this hypothesis.
“What Nielson found by analyzing the data in the study was that although people spend more time on pages with more words and more information, they only spend 4.4 seconds more for each additional 100 words. By calculating reading rates, he concluded that when you add more verbiage to a page, people will only read 18% of it.
Some other interesting findings include:
- On an average visit, users read half the information only on those pages with 111 words or less.
- People spend some of their time understanding the page layout and navigation features, as well as looking at the images. People don’t read during every single second of a page visit.
- On average, users will have time to read 28% of the words if they devote all of their time to reading. More realistically, users will read about 20% of the text on the average page…. “
So, how does busy people like us really read the web?
How did you change your reading habits due to the following conditions:
- distracted by any mean of communication devices and applications (phone, cellphone, twitter, multiple IM accounts, multiple email accounts, skype, and more) – one option is to print it and take 2 steps away from the computer
- distracted by images, navigation, ads within the blog – this is partially avoidable when using feed aggregator
- bombarded with information – see below: Building and associative tag cloud
Here is how I’m trying to deal with too much information:
Building a personal associative tag cloud
As I explain in the beginning of this post I don’t read it all every day. I just can’t. I skim a lot. I read a lot of headlines trying to build “hooks”. I’m filling my brain with links so when I’ll next time see something related I’ll stop for a little longer. I make my head familiar with words, names and terms. It is as if I’m building a tag cloud inside my brain and the more frequent the tag appears the bigger the “font” is or in my analogous the sensitivity to a word, name or term. For instance reading TechCrunch and ReadWriteWeb over time helps me building an associative tag cloud of companies names, technologies and buzz words. I’m aware of the fact that the difference between a “real” tag cloud and the one I shape in my head is that it is not just the frequency that count but also other factors like personal interest and preferences. It is also strongly driven by my intentions. Yet, with every time that I skim these two newsletter I feel that I get more out of it. So, maybe it is no longer skimming but more like “nesting”. Filling information and more tags inside the associative cloud. I do fully read some posts and comments:)
So how do you absorb information? How do you deal with all the noise and distraction?