The blog contains some of the best practices and guidelines for sourcing and evaluating content for Watson enabled applications. Given below is my personal view and recommendations (not endorsed) and should not to be used as a reference for all set of requirements as each client requirements would be different.
Content Sourcing Considerations
Building an application that’s built with Watson requires the right kind of data and content. Watson system answers the question from the content that is ingested in the system. Based on the use case, you would source content within your enterprise or get the information from externally sources, like crawl the web for information like medical journals, financial reports etc. or a combination of both.
As you ingest content in Watson, you need to ensure that you own the content or the content is publicly available and free to use. Please look at the license requirements associated with the content.
Alternatively, you can also tap into to the Watson Content Marketplace to leverage different sources of content based on your use cases. For instance, if you are building a travel application, you can tap into the heath care content from Watson Content Marketplace which provides details on vaccination as part of a user’s travel. The Watson Content Marketplace makes this possible by bringing together different sources of data for developers and content providers, including general knowledge, industry-specific content and subject matter expertise.
As part of content sourcing strategy, you would also determine the lifecycle strategy for managing the content, for instance frequency of adding new content, handling updates to the existing content and validity of the content.
Watson currently support’s HTML, PDF documents, Word documents as part of its ingestion process. Any other format’s need to be converted into one of these formats. Watson currently doesn’t crawl the content. The content needs to be managed and uploaded in Watson through its user interface (referred to as Watson Experience Manager).. The above statements is true at the time of writing this blog.
Content validation is a critical piece in every Watson engagement. To start with, content needs to be in one of the formats which is supported by Watson.
Here are the general guidelines around content validation –
- Content should be in english language. (other languages are being supported)
- Content should be in one of the supported format – Microsoft Word formats (expect 2003), HTML and PDF.
- The content should be in UT-8 format.
- If content contains scanned images/text (OCR), it needs to convert into either one of the supported format.
- The content should not be password protected.
- Content should not contain any personal identifiable information. Any personal identifiable information needs to be handled by the application outside of Watson.
- Images are currently not processed by Watson. Identify substitute for images if important information is being conveyed by images, like flowcharts, process flows which needs to be answered. An image centric content, like financial charts only would not work with Watson. You can use the Watson Image detection service to get the details of the images.
- If the content contains tables, ensure the table has well defined heading. For instance and HTML format would typically have header row and column tag. On other hand, a div css style tables might not be work as expected as relevant met-data for understanding the context of the table might not be available.
Watson strength lies in unstructured data analysis. For instance, you can feed it a medical journal or blog and it will understand what facts, relationship and meaning of the sentence is contained in the documents. On the other hand if you have content only related to financial data and want to do mathematical computation on it, the use case might not be a good fit for some of the Watson products like Watson Engagement Advisor.
Content Structuring Practices
Content can be ingested into Watson “as-is” without any modifications, but typically content like HTML would have many noise elements like navigation headers, sidebars of links which would interfere with the actual body of the element and would be contained in the answer presented to the user.
Watson doesn’t modify the content, so it’s better to cleanse and strip off such elements which doesn’t add any value to the content.
In some cases, based on the content evaluation outcome, some content would need to be modified/structured to aid Watson to understand the content with less amount of training time. If you relate the analogy to a book, a book with index, chapters and sections is easy to read, understand, infer information quickly and interpret, rather than a book with only pages of text.
Here are some of the recommended practices on structuring content –
- The document should contain well defined sections with section title. For example, in HTML, the section title would be through the standard h1, h2, h3.. tags and content within each of the sections would be treated as the body of the content for those section, for word documents if through style formatting (h1, h2.) and PDF is characterized by FONT sizes or FONT style.
- Organize and structure content into logical section’s, preserving the hierarchy of the content. Think of this as a book example given earlier, with well-defined chapters and sections.
- Remove noise from the content, examples include navigation links like headers, sidebars etc. which doesn’t add value to the content. If you are crawling external websites, remove the header/footers, navigation links, and menu to include the actual body of content. Cleansing the content would also ensure the user’s would get the relevant information without the extraneous noise.
- Identify substitute for image’s (Watson doesn’t process images currently) if important information is being conveyed by images, like flowcharts, process flows which needs to be answered.
- Identify how structured data i.e. data from tables (i.e. html tables) would be used in the context of the use case.
- If content source contains PDF documents, try to get to the source of document (if available) as important structured/hierarchy information is lost during conversion.
- Invest in content (source, organize, structure, cleanse, update) as end users get responses to questions from your content sources.