Interview for… Journalists get to know the Semantic Web!

I was interviewed last week by Colin Meek from on the topic of “Web 3.0” and what it means for journalists… You can read the full article in two parts (1, 2). My original answers are part of an interview on their Insite blog. I also had the chance to talk about various DERI offerings in the Semantic Web area including SIOC, SWSE, Sindice, Semantic Radar, etc.

Colin also asked me about other readable data that is being crawled by Semantic Web search engines like Sindice, SWSE or Swoogle. These search engines can usually match keywords in any data that has been crawled or integrated into a semantic store, not just people. It could be from structured information about people, places, dates, library documents, blog items or topics, whatever. In fact, there is no limit to the types of things that can be indexed and searched – since RDF (an open data model that can be adapted to describe pretty much anything) is used as the data format. Anyone can reuse existing RDF vocabularies like SIOC to publish data, or they can publish data using their own custom vocabularies (e.g. to describe stamp collecting or Bollywood movie genres or whatever), or they can combine public and custom vocabularies (e.g. take FOAF and your own vocabulary about soccer to describe players and managers on a soccer team). Geotemporal information is particularly useful across a range of domains, and provides nice semantic linkages between things. For example, having geographic information and time information is useful for describing where people have been and when, for detailing historical events or TV shows, for timetabling and scheduling of events, etc., and for connecting all of these things together (“I’m travelling to Edinburgh next week: show me all the TV shows of relevance and any upcoming events I should be aware of according to my interests…”).

The keyword searches in the Sindice search engine allow you to find more information on where resources of interest are (searching for “john breslin” will point to all public pages that contain semantic information about yours truly). Sindice also has an API that can provide results in a resuable (semantic) format that can be leveraged by other applications. Alternately, SWSE (Semantic Web Search Engine) shows you semantic information about the object of interest (e.g. my phone number, my friends, etc.) which may be derived from multiple sources (this information on me comes from tens of sources consolidated together via unique identifiers for me or through what’s called “object consolidation”).

For me, this article highlighted the fact that the Semantic Web community needs to be very aware that one of the key features of the Social Web for journalists and for many others is the ability to find a lot of personal and sensitive information on people, and with the advent of “Web 3.0”, we need to realise that (“with great power comes great responsibility”) the availability of contextual and semantically-related information is going to become even more apparent, and people will talk about it in both positive and negative terms. Educating site owners about what semantic data they may be publishing (knowingly or unknowingly, even if it’s just RSS feeds) is needed, and developers should determine exactly what opt-in or opt-out mechanisms are required before implementing semantic solutions. Users also should be aware of the benefits and other potential uses of their semantic data.

I think now is the time to avert any scares, because in reality, the data that is on the Web or the Social Web can be used in new ways anyway, whether metadata is present or not (some facts can be derived). Google have recently implemented some discussion forum parsing algorithms to determine how many posts are on a thread, how many users posted on that thread and when the last post was made. You can see this in a search result I did for “irish pubs” below. It’s not complete, and probably relies on identifying certain HTML structures for non-Google discussion sites, e.g. you can see two threads in the middle that don’t show details of the total posts or commenters. But it’s moving towards the SIOC vision of providing more metadata about discussions on the Web to help you in finding more relevant information – whether the site owners want to provide Semantic Web data or not!

Making data available semantically enables computers to help us do things we cannot easily do (or cannot do at all) right now, and this is what makes it so powerful. We also need to think more towards educating people about the benefits as well as how we can minimise any hazards. Is this a job for W3C SWEO? As my colleague James Cooley said: “I think scientists thought the benefits of GM food were so obvious that there was no case to make. Then you got Frankenstein Food and the game was up.”

For journalists interested in the Semantic Web, I’d recommend reading this paper entitled “SemWebbing the London Gazette” by Jeni Tennison and John Sheridan which describes how they have exposed information from their newspaper website using RDFa so that it becomes easy to re-use (slides here). You can also view some interesting slides by Colin Meek from a seminar he gave to journalists about the Social Web in Olso a few days ago. It’s in three parts (1, 2, 3). I’ve embedded the third part (on the Semantic Web) below…

Other posts referencing this article:


3 thoughts on “Interview for… Journalists get to know the Semantic Web!”

  1. John – very interesting post. I had the pleasure of discussing these issues with a range of other journalists at the ‘Journalists and the Social Web’ seminar in Oslo at the weekend.

    I actually made four presentations at the seminar with the first setting the context about privacy issues and journalists (I haven’t yet made that available). I ventured that there are three categories of personal information on the internet: the information people intentionally publish; information about themselves they have no control over; and, lastly, information you make available to specific sites under certain conditions.

    My feeling is, there is a great deal of confusion about the latter two and the divisions between these categories are becoming blurred. I agree with you, that the semantic web community needs to grasp this nettle.

    I think there is an absolute distinction between what people INTENTIONALLY PUBLISH and what they make ACCESSIBLE. It is my view that journalists must be aware of this distinction: information on a profile may be accessible, but the writer may not have wanted the information published outside of his or her friends or the bebo community. Equally, I’d argue, that most people in LiveJournal do not expect their profile information to be exportable in FOAF files.

    At we have researched ways to access information on social networks and, as a result, have written to the Press Complaints Commission in the UK to highlight how social network privacy controls are failing people who join these networks. Given web3.0 developments, there may be even more scope for exploiting tools to pinpoint personal information.

    At the seminar I argued that because technology means these issues are in continual flux, journalists can only respond by debating the issues as a community and being professionally aware of the minefield. I will be outlining my introductory talk at the seminar soon in my Insite blog on

    I think the sooner the semantic web community focuses on the privacy issues the less risk there will be of ‘privacy scares’ related to web3.0.

    By the way. Thanks again for your time and patience with the interview for my features. Very much appreciated. I’d really like to stay in touch on these issues.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s