Global Change Information System (GCIS)
The U.S. Global Change Research Program (http://www.globalchange.gov) sponsored the creation of a new information system, the Global Change Information System (GCIS) that provides a web-based source of authoritative, accessible, usable, and timely information about climate and global change for use by scientists, decision makers, and the public. Launched coincident with the 2014 National Climate Assessment (http://nca2014.globalchange.gov), it is initally focused on capturing and presenting all of the supporting information (datasets, papers, people, projects, etc.) from this report. The GCIS will eventually link together climate and global change information from across the federal government.
The GCIS API is available at http://data.globalchange.gov.
This session will present an overview of the system, status and progress with some initial information modeling and web site concepts. There will be time for discussion and feedback about the long term vision for the system.
Agenda:
Global Change Information System (GCIS)
Introduction to GCIS - Robert Wolfe – USGCRP GCIS Technical Lead
· US Global change research program
o Coordinates federal research, prioritizes and support cutting edge research, accesses the state of scientific known, communities research finding
· Played a major role in the national climate assessment – only been 3 since 1990, latest one just release – includes the uncertainties with the findings, analyzes the trends and projections into the future
o 3 reports 2000, 2009, 2014
· GCIS – right now
o Have an initial prototype – supports the distribution, presentation and documentation of the NCA integrates into USGCRP website
· Information quality act – reproducibility & transparency – very important to show how you got your results
· Complete traceability for NCA content – have traceable sources for references
o Spectrum from transparency to reproducibility (if they have the right tools/computers)
· Define categories of information within the report
o Figures, images, data source
o Build a process for collecting source information – named sources, web-based survey providing ISO 19115, IT infrastructure
· The website for the NCA has been praised as a well done website by Whitehouse
o Have a structured data server webpage
o Can come in with an api to access the information
o Working on documenting the data sources – about 20+ done
· Dataset metadata from a figure – can get all the background/metadata about the figure – if click on figure see time range, who created, what data sets, the activity to derive the information
· GCIS structured data server – capture, identify, organize, present, maintain
o Identify is a key point – they must be persistent and resolvable
· GCIS database/api –restful api, rep in JSON and in Turtle
· Have a very complete ontology that they are working with
o Have classes and properties for report and each chapter of the report
o Started with existing ontologies
o Used PROV_O classes and various other ontologies
· SPARQL example – http://data.globalchange.gov/examples
· two parallel paths – NCA3 release (1) and (2) populate GCIS
o Q – working on this for future documents – not planning on going back in time
o To get this information, have to do it well you are generating the report
o Before GCIS it was very difficult to capture it
· The future
o In 2013 – started to support the NCA report
o Just started to put indicators in for the initiative
§ Demo is the 1st 24 and then the pilot is all the rest
o Use health assessment to make this easier
o More ontology development/improvements – some aspects that we can do better – learn from the pilot
o Proposal on the table is for sustained assessment – don’t break the team in a few years – keep building on what you have now
o Possibly supporting the EOA (earth observation assessment)
· Q what was the responses of positive vs. negative responses to the NCA report
o News – publicity – over 2000 articles
o Pointed to specific locations on the website – one element was being able to select specific sectors or specific region – easy to see the impacts on what they want to know
o General feeling – more positive responses than in 2009
o There is an NCA team that is looking into this specifically
o Criticism came from expected people
o There is a system for review and comment on all files that are put out
· Q – curation and maintenance of GCIS
o One of the questions for the next report – should it be web first – usually pdf and then web – in 3 years – do a web document and then producing a document to represent it
o Is it semantically enabled from the beginning
o Re-evaluated how things were done and looking to improve the process
· Q – before NCA3 there was author instructions and templates – expect a different approach to instructors/templates
o There is a lot of evaluating the process of generate the reports
o Main take home message – start this process early
· Q what sort of feedback have you gotten from the authors – haven’t gotten that yet
The global change information system: nuts and bolts – Brian Duggan, Steven Aulenbach, Robert Wolfe, Justin Goldstein
· NCA3 is about creating a PDF and a website
o 800+ page pdf with content that was traceable
o Website – another way of looking at the pdf
· How – lots of different tools – ex. Googgle docs, scientific software
· Resources – things that needed to fit into the report
· Role of GCIS
o Common points of reference – ex. Bibliography has to refer to the same article the same way – must be a structured way
o Vocabulary had to be consistent – language, terminology, vocabulary, ontology (Ex. Publication, bib entry, and a citation)
o Needed uniform way of identifying resource – use uris (urls)
o Needed fine grain tracking of all the changes made – not just version control of a document – looking at database and incorporating who was asserting what was true about a relationship
· GCIS is backing the website for the NCA3 report… so a call on the website calls GCIS – back end is using PostgreSQL
· GCIDs are gcis identifiers – website/domain name/orcid or uuid or identifier
· Functionality – support report & website, minimal landing page for resources, JSON API, semantic information, interoperability (using existing identifiers), public SPARQL endpoint
o Interoperability by reusing identifiers
o If identifiers are being used outside of the GCIS are queried to relate the end points
· Q – previously – what was the back end – PostgreSQL – but now sounds like Virtuoso
o There were strength and weaknesses of both of these
§ Need referential integrated
§ Fine grain auditing
§ Cascading deletes
§ Needed performance aspect & tools for providing web services
o Also had a very short time to put together a product that wouldn’t fall over – will do more as time progresses
· Q can you provide an example of relationship that PostgreSQL couldn’t do
o How is a figure in a report different or the same as one provided – not just a mapping of one figure to another but include meaning – have a generic entry for a publication, correspond to other relationship
o Tables in DB just to generate Turtle
o This is an information model that branch into the schema
o SPARQL is good for getting data out – but putting in, managing, deleting… one GCIS resource – a single insert can create a whole thing… it get tedious in the long term (triples)
· Testing – enhancing semantic model – runs a test for each github and SPARQL - also not just testing code but also the content (not just internal consistency, but continuous content validation
· Clients – python, perl, javascript, php
· What is the difference between narrative vs. structure content – ex background on an image
· Semantic vs. relational – is his critical for the front or is it going to be ingested
· Concept of resources – with a uniform identifier
· Identifiers – there is no good identifier for organization – dataset = were local to NCA3 process & changing identifier – if title change than identifier changed
· Have entities, agents, activities – have publications (entities), contributors (agents), and then activities – modified prov model
· http://data.globalchange.gov/resources
Discussion – Steve Aulenbach
· Q – Statistic analysis
o Have the ability to see who has viewed it – any time it is linked then it is recorded
o Have google analytics
· Q background database – is it distributed and what is capacity
o No – it is a single instance of PostgreSQL
· Q – backup/preservation – where is it backuped offsite
o A couple of places – UCAR – top enterprise level backup
o During development (Before web engineering group) – work with local server that had a great set-up
· Q – how much time did you spend extending the schemas in PostgreSQL
o Still are
o Had schema changes incorporated into the release process – includes a patch – good patch management – this allowed for extension of the relational model – will have to keep doing
o Automated – so it isn’t a barrier
o All software is on Github – everything is open source activity
o Had a deadline – but the people that provided can move as they wish (the Whitehouse) there is some interesting naming conventions, not always consisting, it is a real live prototype – made release date and now moving into it and iterate as they clean up
§ Testing guarantee that it will be easy
· Q – mentioned testing – tested data/metadata against podac - how
o Not yet
o Would like to work to set that up – bring it into the open source process
o Doi are important for what a data center can do
· Q – challenges that you faced to get the figures/maps
o People work in spreed sheets – this was really difficult for “us” – report was written in word. People are using endnote for bib – but ever author submitted word documents
o Hope to do things differently in the future
o What should be put into GCIS
o Kurt – references – who actually looks at each reference – intent is for a person… here need it to be computer readable – for this purpose had to have high quality references
o It is open source – thus can be stood up for others to use – but can’t add to the GCIS database of references because it is still for a specific
o References are all pulled directly from the publisher – trying to automate this
o Need to comply with the information quality act – required full tracably
·