Professional Documents
Culture Documents
Roma / Italy
Planning
-2-
Some related user stories
http://wiki.iks-project.eu/index.php/User-stories
Feel free to add new user stories to complete the use cases.
-3-
1st Participant: DERI
Thursday November 12, 2009: 17h20 – 17h40
Fri, 03 Jul 2009 from Stephane Corlosquet <stephane.corlosquet at deri dot org>
Below is the architecture that DERI would like to suggest for the IKS Semantic Search
Engine. The figure [1] contains a set of CMS sites complying to the best practises of RDF
data publishing, which include RDFa, a local schema export (site vocabulary), a
SPARQL endpoint. We have worked on a set of modules for Drupal detailed in a
technical report at [2], but their features could be generalized to other CMSs. The sites
can request to be included in the IKS search engine via a form on the IKS search engine
site or programmatically via a ping. Pings are also used in the case where a specific
resource/page has been updated on a given site in order for the search engine to schedule
a recrawl of the resource as soon as possible.
The semantic search engine stack is composed of several layers of data gathering,
parsing, validation and indexing. The search engine first gathers the data by crawling the
sites, it then parses the RDF data with the any23 parser [3], a java library that extracts
structured data in RDF format from a variety of Web documents (supports microformats,
RDFa and other common RDF serialization formats). If needed, the NxParser [4] cleans
up the data and formats it in n-quads [5]. Before a site can be included in the IKS search
engine, it first goes through the RDFAlerts validator, which ensures the RDF data
contained in the sites complies with the RDF publishing best practices. RDFAlerts also
does some RDF consistency checking. Additionally, other IKS specific policies regarding
the sites included in the search engine could be added here. Finally, the SWSE engine [6]
takes care of the indexing and storage of the data. Powered by YARS2, it provides
distributed storage and retrieval facilities. Indexing structures are optimized for retrieval
of RDF statements including context (quads) while minimizing the need for joins, plus
Lucene fulltext indexing for efficient keyword searches. SWSE's SPARQL endpoint
allows to plugin any RDF visualization tool, e.g. VisiNav [7] for example. See the
screencast at [8] (1'36) for the possibilities offered by VisiNav.
[1] http://srvgal65.deri.ie/files/iks_search_engine_cloud.pdf
[2] http://www.deri.ie/fileadmin/documents/DERI-TR-2009-04-30.pdf
[3] http://code.google.com/p/any23/
[4] http://sw.deri.org/2006/08/nxparser/
[5] http://sw.deri.org/2008/07/n-quads/
[6] http://www.swse.org/
[7] http://visinav.deri.org/
[8] http://www.youtube.com/watch?v=r4WgTRIRoa0
Mon, 05 Oct 2009 - From: Axel Polleres <axel.polleres at deri dot org>
-4-
We have/had to sort out things since the main developer of the SWSE search engine and
architecture, Andreas Harth, moved to the AIFB, Uni Karlsruhe, and in the course of the
move, we had some delays in answering the questions of what setup we could provide.
Additional notes:
- The update frequency of the index mainly depends on the number of statements we
have to parse, clean and process.
We'd hope that is sufficient for the current project needs, if not, please let us know in
what ranges your requirements would be. Without additional resources we are not
capable of offering a more advanced setup of SWSE/yars2 short term (could include
distributed index build, distributed yars2 instances, distributed SPARQL processing,
reasoning [4] on the crawled data, but we'd suggest to get things going small and then see
where we'd get from there.
Such a setup could be the starting point for a semantic search engine for IKS and on top
of that demonstrate the feasibility of a federated CMS infrastructure as we sketch it in [3,
Section 5.2], so we'd be very excited about getting this going in collaboration with IKS
and then explore further opportunities jointly!
Best,
Axel, Juergen, Aidan
[1] http://lists.iks-project.eu/pipermail/iks-community/2009-July/000028.html
[2] http://swse.deri.org/
-5-
[3] Stéphane Corlosquet, Renaud Delbru, Tim Clark, Axel Polleres, and Stefan Decker.
Produce and consume linked data with drupal! In Proceedings of the 8th International
Semantic Web Conference (ISWC 2009), Lecture Notes in Computer Science,
Washington DC, USA, October 2009. Available at
http://www.polleres.net/publications/corl-etal-2009iswc.pdf
[4] Aidan Hogan, Andreas Harth, and Axel Polleres. Scalable authoritative owl reasoning
for the web. International Journal on Semantic Web and Information Systems, 5(2), 2009.
Available at http://www.deri.ie/fileadmin/documents/DERI-TR-2009-04-21.pdf
Wed, 4 Nov 2009 From: Axel Polleres <axel.polleres at deri dot org>
I will (together with Juergen Umbrich, who shall send a separate mail) our ideas
on Semantic Search over networked, RDF-enabled Drupal sites [1].
Our approach is to regularly crawl and index those sites with a specialised instance of our
house-made semantic search engine SWSE [2] which offers not only a search interface
but also a SPARQL endpoint that let' you query over those sites. Additionally, if you
want to have specific, current site information our Drupal modules enable separate live
SPARQL endpoints locally on the sites. See also the architecture that Stéphane posted
earlier on this list. [3]
Axel Polleres
1. Stéphane Corlosquet, Renaud Delbru, Tim Clark, Axel Polleres, and Stefan Decker.
Produce and consume linked data with Drupal! In Proceedings of the 8th International
Semantic Web Conference (ISWC 2009), Lecture Notes in Computer Science,
Washington DC, USA, October 2009. Springer. Best paper award In-Use track.
http://www.polleres.net/publications/corl-etal-2009iswc.pdf
2. http://swse.deri.org/
3. http://www.interactive-knowledge.org/content/iks-search-engine-proposal
-6-
2nd Participant: Trialox
Thursday November 12, 2009: 17h45 – 18h05
Mon, 19 Oct 2009 From: Reto Bachmann-Gmür <reto.bachmann at trialox dot org>
Some of you already met me at the IKS requirement meeting in Salzburg, I'm looking
forward to meeting you again and more of you next month in Rome.
For those I didn't already met I'm quickly introducing myself here.
I'm working with trialox [1] a startup founded last year at the University of Zurich. We're
working on open source software that makes it easy to develop semantic web enabled
applications. Our system is based on OSGi technologies and support various RDF stores
as backend. The principal supported languages are Java and Scala. As we are near the end
of a major Sprint, I'll be posting more information on this software foundation very soon.
Basing on this foundation we're also building a Web Content Management System
leveraging semantic web technologies especially for the benefit of not-for-profit
organizations. We are working together with the WWF [2] to build a system that allows
better access to their vast and distributed content, both with their public website as well
as the internal information infrastructure.
All our products are open source and we're looking to build a community around the open
source projects, as well as business partners we could help implementing semantic
solutions for their customers.
So that's what I've been working on for a bit more than a year now. Before I've been
working in England for Talis and for HP Laboratories. At HP Labs I was working with
the Jena team and implemented a system for versioning as well as tracking provenance of
RDF Graphs.
My passion (or is it addiction?) for the semantic web dates back to 2002. I started
implementing the annotea protocol as a decentralized exchange system and continued the
idea of decentralized, trust and relevance based information exchange with the knobot
open source project.
Cheers,
reto
1. http://trialox.org/
2. http://www.panda.org/
-7-
3rd Participant: Kiwi
Thursday November 12, 2009: 18h10 – 18h30
Mon, 12 Oct 2009 From: Rolf Sint <rolf.sint at salzburgresearch dot at>
My name is Rolf Sint and I am researcher and developer at Salzburg Research. I studied
Computer Science and Management at the University of Salzburg. Currently I work on
the EU-funded project KiWi (http://www.kiwi-project.eu/ ) and I will present the
semantic search functionality in KiWi in the next IKS workshop in Rome.
The KiWi-System aims to break system boundaries in that it serves as a platform for
implementing and integrating many different kinds of social software services. And it
intends to break information boundaries by allowing users to connect content and to
connect each other in new ways. KiWi is a software platform that allows users to share
and integrate knowledge more easily, naturally and tightly, and to adapt content and
functionalities according to their personal requirements. In KiWi the navigation and
search of content is a key issue and is realized in several ways. One possibility to
navigate within KiWi is a very flexible facetted search, which allows a dynamic
configuration of the search facets. Please find some screenshots of the current KiWi
system attached.
Best regards
Rolf Sint
-8-
-9-
4th participant: Yahoo! Research
Friday November 13, 2009: 10h00 – 10h20
Fri, 25 Sep 2009 From: Peter Mika <pmika at yahoo-inc dot com>
Hi All,
My name is Peter Mika, and I work as a researcher and data architect at Yahoo!, based in
Barcelona, Spain. Our research lab is part of Yahoo! Research [1] and has been
established in 2006. We are covering a wide range of topics, including multimedia,
distributed systems, data mining (in particular, web mining), and NLP.
On the product side, I'm working as a data architect on KR questions related to how we
consume and use metadata inside Yahoo. As an example, many of you might have heard
of SearchMonkey, which allows site owners and developers to create applications that
change the way search results are presented, by using metadata associated with those
pages [2]. I'm also doing a part of the evangelism, talking to our communities of
developers and publishers, which gives me a fair bit of understanding of how people
relate to semantic technologies 'in the wild'.
Best,
Peter
[1] http://research.yahoo.com
[2] http://developer.search.yahoo.com/start
- 10 -
5th participant: salsaDev
Friday November 13, 2009: 10h25 – 10h45
Wed, 23 Sep 2009 from Stephane Gamard <stephane.gamard at salsadev dot com>
salsaDev uses a technology emerged from language acquisition research at the Rensselaer
Polytechnic Institute to index textual information at a conceptual level. Our approach to
information access is not a replacement solution, but a high-value added feature:
knowledge workers are provided with a sense-centric/meaning-aware access to their
relevant content.
A very pragmatic and typical user cases: An IP lawyer, while filling for a patent, must
read, evaluate and discriminate tremendous amount of non-relevant information (too
often also out of the scope of his own area of expertise). A sense-based system such as
salsaDev's can read the patent application and provide meaning-based related information
that might be of interest.
This is salsaDev in a nutshell (an extended one I am aware). I am sure this short
presentation raises more questions than it answers, so please feel free to send me any
questions you might have. In the mean time and in preparation for the next workshop I
wish you all a very semantic day
Cheers,
Stephane
- 11 -
- 12 -
6th participant: Scribo / Nuxeo
Friday November 13, 2009: 10h50 – 11h10
Tue, 20 Oct 2009, From: Olivier Grisel <ogrisel at nuxeo dot com>
Dear all,
As part of the Scribo project [1], we are working on integrating semantic knowledge
extractors to semi-automatically enrich the knowledge base with named entities and
semantic relationship found in unstructured text content using UIMA components. We
plan to integrate a CRFs-based Named Entities extractor trained on multilingual corpora
such as wikipedia. CRFs are a machine learning algorithms to perform Natural Language
Processing of token sequences.
The same kind on semantic hashing algorithms should also work on textual content [6]
described with sparse TF-IDF vectors. A preliminary backlog a semantic related feature
for the Nuxeo platform is to be found here in our Jira instance [7].
[1] http://www.nuxeo.com/en
[2] http://www.scribo.ws/
[3] http://wiki.iks-project.eu/index.php/User-stories#Story_03:_Similarity-
based_Image_Search
[4] http://code.oliviergrisel.name/pyleargist/src/tip/README.txt
[5] http://code.oliviergrisel.name/libsgd/src/9f3f374becc8/examples/semantic_hashing.py
- 13 -
[6] http://wiki.iks-project.eu/index.php/User-
stories#Story_09:_Similarity_based_document_search
[7] http://jira.nuxeo.org/secure/IssueNavigator.jspa?reset=true&pid=10273&status=1
Wed, 21 Oct 2009 From: Olivier Grisel ogrisel at nuxeo dot com
Just to make it more explicit, for the demo session I should be able
to showcase the current state of the Scribo project that mainly
focuses on IKS user story #5 and a prototype of similarity search in
pictures (IKS user story #3).
http://wiki.iks-project.eu/index.php/User-
stories#Story_05:_Assistance_with_Semantic_Tagging
http://wiki.iks-project.eu/index.php/User-stories#Story_03:_Similarity-
based_Image_Search
Best,
Olivier
- 14 -
7th participant: Zemanta
Friday November 13, 2009: 11h15 – 11h35
Thu, 15 Oct 2009 From: "Tomaž Šolc" <tomaz.solc at zemanta dot com>
Hi everyone!
Zemanta's content suggestion system is the main product of our company - it takes a
fragment of plain text as its input and provides images and articles related to the topic of
the text as well as relevant tags and automatic explanatory in-text links. It achieves that
by first annotating the text with several components (like named entity extraction, word
sense disambiguation, classification) and then using the annotated text to search through
collections of similarly annotated objects. This system can be used as an assistant for
bloggers and other authors: suggestions can be either automatically or manually applied
to enrich news articles and blog posts.
At the demo session of the next IKS workshop I would like to show a live demo of our
system [1] and explain a little bit what is happening behind the curtains. How exactly the
annotations look like, how our word sense disambiguation works and how we use open-
source solutions like Lucene to search large collections of documents.
[1] http://www.zemanta.com/demo
Best regards
Tomaž
--
Tomaž Šolc, Research & Development
Zemanta Ltd, London, LLjubljana
www.zemanta.com
mail: tomaz at zemanta dot com
blog: http://www.tablix.org/~avian/blog
- 15 -
Tue, 3 Nov 2009 from: "Sander van der Meulen" <sander at trezorix dot nl>
Hi All,
My name is Sander van der Meulen and I am Technical Manager at Trezorix. Trezorix
was founded in the year 2000 and is located in Delft, The Netherlands.
Our main software product is the RNA Toolset, a semantic web based innovative tools
for working with content, metadata and reference structures. The goal of the RNA
Toolset is to create an open environment for knowledge workers to create and edit their
content, and to enable the knowledge workers to publish the content to a semantically
rich search environment.
The roadmap for the development of the RNA Toolset points to implementing a federated
Sesame/OWLim RDF layer with RDFS and OWL support as the search platform.
Currently we only have RDF configurations in our test environments. In our production
environments we've successfully implemented Solr as the search platform, providing
superb free text and facet searching. But the lack of relational constructs and inferencing
capabilities in Solr force us to move to the richer RDF environment for more complex
knowledge systems.
Best regards,
References:
1. http://www.rnaproject.org/
2. http://www.sterna-net.eu/
- 16 -
9th participant: Sourcesense
Friday November 13, 2009: 12h05 – 12h25
Fri, 23 Oct 2009 From: Tommaso Teofili <tommaso.teofili at gmail dot com>
Hi all,
my name is Tommaso Teofili and I am new to IKS. I am a software engineer at
Sourcesense [1], a european company specialized in the integration of open source
projects. We as Sourcesense strongly believe in open source and everyone in the
company is encouraged to contribute to the projects he's working on. Many of us
contribute and commit to open source projects like Infinispan, JBoss portal, Alfresco,
Apache POI, Apache Chemistry, Scarlet, WURFL and others [2].
Before joining Sourcesense I started studying, using and then contributing to Apache
UIMA [3] for my graduation thesis (since November 2008), then on August 2009 I
gained the committership.
At the moment the project is on his way towards the 2.3.0 release and possibly become an
Apache TLP [4]. During this period I realized some prototypes of applications using
UIMA for semantic search, one of which I am going to show during the workshop.
[1] : http://www.sourcesense.com
[2] : http://opensource.sourcesense.com
[3] : http://incubator.apache.org/uima
[4] : http://wiki.apache.org/incubator/October2009
- 17 -
10th participant: Semantic Technology LAB
Friday November 13, 2009: 12h30 – 12h50
Fri, 23 Oct 2009 From: Alfio Massimiliano Gliozzo <gliozzoat gmail dot com>
Dear all,
I will present an application of knowledge retrieval in the next IKS workshop “Semantic
Search - Fact and Fiction” in Rome. This is a semantic search engine called “Semantic
Scouting” working on an RDF/OWL ontology describing the CNR organization,
developed as a collaborative work by almost all members of my lab as a showcase for the
capabilities we are currently developing here.
CNR is the largest research institution in Italy, employing more than 20k researchers,
organized into departments and institutes, subdivided into research units characterized by
different competences, research programs, and laboratories. We performed a migration of
the information spread into different CNR databases into a common RDF/OWL
knowledge base containing both texts (e.g. the titles of the papers wrote by any
researcher) and structured data (e.g. relations between researchers and their institutes) [1].
The result is a critical mass of data representing around 30k instances organized into 50
classes and 1.8M triples.
Further, we expanded the knowledge base by performing some simple inference (e.g. the
co-authorship relation) and we automatically generated relations with linked open data
resources, and in particular DBpedia categories, by exploiting advanced text processing
techniques.
Then we developed a knowledge retrieval engine whose output are entities of different
types, where the input are queries in either Italian or English language. Using such
entities as entry points, we can further explore the ontology following two different
modalities: browsing the graph of relations around each entity or opening forms
representing relevant attributes and relations.
- 18 -
The result is a running system that I will show at the workshop and will be delivered soon
as a service within the CNR intraweb.
[1] Alïo Gliozzo, Aldo Gangemi, Valentina Presutti, Elena Cardillo, Enrico Daga,
Alberto Salvati and Gianluca Troiani."A Semantic Web Layer to Enhance Legacy
Systems, Proceedings of 6th International Semantic Web Conference, Busan, Korea,
2007
- 19 -
11th participant: Semantic MediaWiki
Friday November 13, 2009: 12h55 – 13h15
Hi all,
"At AIFB (Karlsruhe Institute of Technology) I work on storage, query processing, query
interface and ranking on integrated collections of structured (RDF) data and text (DB &
IR). I will demonstrate the search solutions we have developed. One is a semantic search
extension to SMW (http://semanticweb.org/wiki/Special:ATWSpecialSearch) that
computes completions and translations of keywords. This results in expressive structured
queries that can use to retrieve precise answers from semantic wiki. The other called the
Information Workbench (http://iwb.fluidops.com/) supports the lifecycle of “interacting
with data”, i.e. from data integration, to semantic search, data manipulation, presentation,
visualization up to data publishing”.
Cheers, Thanh.
------------------------------------------------------------
- 20 -