Severe thunderstorms in New Mexico and Texas on 2007 23 March, viewed as convection via two satellites (MODIS and GOES). IR images on top; water vapor channels on bottom
The reason for being a librarian, or a data curator, is the library (or data center), right? Well, that may be simplistic. The reason is more properly expressed as facilitating the process of humans finding and using information. But the information must exist to be found and used, and the information collections are what we information professionals love.
I was drawn to the MLIS degree initially by a pure love of libraries. I love being in them; I love looking at pictures of them and reading about them. My favorite library activity is to play "candy box". I visit a library, preferably a university library with lots of journals (such as SFSU's J. Paul Leonard Library), and I pick an item at random–say, a journal with a title that is not immediately understandable to me, or is not necessarily one of my known interests–and start reading it until I run into a question. That question leads to research, which leads anywhere in the library. Usually I only survive three or four "jumps" before I am engrossed in some book I never would have found or thought of looking for before.
Once I determined I was going to pursue an MLIS, though, I started remembering aspects of the data center where I did software testing: NASA's Goddard Earth Sciences Data and Information Services Center (GES DISC) (formerly Data Active Archive Center, or DAAC). As part of the DAAC, thanks to its director at that time, Paul Chan, we didn't just keep our heads down and do our own jobs. I learned about the entire mission of the DAAC and all its parts in periodic lectures, informal weekly seminars from my fellow workers explaining their jobs, and casual interactions with my co-worker friends.
Faults on Earth (San Andreas, CA) and Europa. Comparisons of geological knowledge about Earth helps us understand extraterrestrial geology.
I learned about the audacity of deciding to create, download, catalog and serve a terrabyte of information a day. (A terrabyte! Almost unimaginable in 1998. Laughable, perhaps, in today's age of big data.) I learned about tape drives for storage and limited media lifetimes. I learned about data levels (data processed to varying levels of sophistication) and metadata. I learned about why you should take pictures of the Earth and analyze them, and I learned about why Earth science is just as interesting and necessary as space exploration.
So in addition to being a "library-phile", I began saying, as early as my first iSchool class introduction discussion post, that what I wanted to do with my degree was archive and retrieve earth science datasets. Over the years this goal was honed to "become a data curator for geoscience datasets." And what it means is being a librarian for really cool remote sensing information–helping to file it in such a way that it can be found, and preserve it so that it is available, and find just the right section of it for each client that comes along.
Collections are nothing without information users (discounting the possible value of saving a collection for its own sake, as in A Canticle for Liebowitz [Miller, 1972]), but collections are wonderful in their own right. I hope to participate in creating and maintaining collections the rest of my life.
Top[12]
I have worked with several digital collections during my career: I have built and maintained multiple collections of test cases, and I have performed software testing in organizations with more traditional data collections of remote sensing data, web pages, and science experiment data.
The first data collection I worked on, at the Goddard DAAC, was a theory. It was the MODerate Resolution Imaging Spectroradiometer (MODIS) collection, which didn't exist yet, as the first MODIS-bearing satellite, Terra, had not yet launched. I was testing the integration of the science software with the custom operating system; together these would process incoming data, creating multiple levels of processing for varying scientific needs. Some fifteen years later, I have accessed images from this collection for various projects for my MLIS, and I am proud to have played a part in preparing the DAAC for this collection. MODIS images pop up in media from time to time, such as this tweet from the Weather Channel, which is gratifying.
At NASA Ames Research Center, working on the Center TRACON Automation System (CTAS), I worked on a different sort of collection, one I've built in multiple software testing jobs: a collection of test cases. CTAS "provides automation tools for planning and controlling arrival air traffic" (De Los Santos, 2014). At CTAS, we used actual recordings that we made every night of live flight information to look for bugs. One classic bug occurred when an airplane was told to loop around the airport, due to a missed approach, before trying to land again. Because the plane had been in the landing pattern, the CTAS software had already planned out its landing route, and when the plane broke out of the landing path to fly around the airport, the CTAS software did not understand and crashed. (The software in the test center crashed; it had not yet been deployed to air traffic controllers, and no planes crashed). It was an excellent real-life example of a condition that had not been anticipated by the software designers. Test cases like this are critical to software testing. They can be standard, anticipated use scenarios, or they can be edge conditions such as a plane having to loop around, or they can be stress conditions (which we tested at CTAS with the 4 a.m. influx of FedEx flights in and out of Memphis ["The FedEx Superhub", 2013). Having a collection of input data and user actions that trigger known conditions is required for regression testing, which is testing performed on software after a change to be sure that what used to work still works and no new bugs have been introduced. While I had a collection of testing procedures for the software at the DAAC, this was the first collection of test cases that I worked with.
Another data collection that I worked with, as a software tester, was the collection of bookmarks and saved pages that users created at Furl, a social bookmarking site. Again, certain types of web pages were used in our test case collection; however, the individual collections of web pages were also something I worked with. Users had the option to share the saved items, either with individuals, groups, or the public, which created different types of collections. All had to be tested for integrity (was the page actually saved and can it be retrieved in a readable format?), legimitacy (spam was a major problem due to the public option), usability (could our information be read on multiple OS-browser platforms?), and privacy (was a private save truly private?), among other things. Sadly, LookSmart sold Furl to Diigo, which ignored the millions of saved web pages (deleted them, as far as I know) and provided less functionality for the bookmark collection. I personally moved to pinboard.in where I have greater functionality, though still no way to save snapshots of web pages, which was Furl's greatest feature–it provided a way to create personal web archives that survived the ravages of link rot.
A collection I am working with in my current job is the flow cytometry experiments that have been uploaded to Cytobank. There are similar requirements for test case collections, as outlined above, and for testing the maintenance of the data itself, as described in my Furl job. In addition to the "raw" flow cytometry data and metadata uploaded by the scientists, many create more data by using Cytobank's visual analysis tools to explore their data. All of that data has to be protected for privacy, maintained, backed up, and readily available day or night to scientists around the world. We have run into issues with server speed and internet access in other countries such as India and Pakistan (including outright government-mandated power outages!), and our software has to make the data available on multiple OS-browser combinations, including old ones. (Thankfully, we officially no longer support Internet Explorer 6 or 7–very difficult to work with for modern web pages with Javascript.)
Cytobank experiment data and analyses can be made public, which is important for the science of flow cytometry and the scientists and doctors who use it. Readers of journal articles are able to go to a link on the Cytobank site and see the data for themselves; they are often allowed, by the experiment owners, to do temporary analyses themselves so that they can judge the veracity of the data and the published scientific conclusions. As a tester, I help ensure that this scientific data collection is safe and accessible.
Top[12]
Paper: "Geospatial Data Quality Metadata Standards"
Also available on ESIPFed Information Quality wiki
In "Seminar in Archives and Records Management: Electronic Records" (LIBR284), I wrote a paper examining the tangle of data quality standards. Data quality is important to evaluate so that users know whether they can use a dataset. This evaluation can be for the whole dataset or down to the pixel or bit level. In order to evaluate data quality, data curators have to understand the data quality standards, which are evolving and complex, so that they can provide data quality metadata that has meaning outside the archive organization. I am a student member of the Federation of Earth Science Information Partners (ESIPFed), so I added this paper to the ESIPFed wiki page for IQ standards, which is part of a resource web page for the ESIPFed Information Quality cluster (workgroup).
For "Resources and Information Services in the Disciplines and Professions: Maps and Geographic Information Systems" (LIBR220), I did a joint assessment of the Federal Depository Library Program's (FDLP) map resources in the San Francisco Public Library (SFPL) main branch. My assignment partner did an assessment of the online Yale map resources from the FDLP. Both assessments, while short, were disappointing; access was not easy. In fact, physical access to the maps in SFPL was not disability-friendly, aside from the difficulty finding and obtaining the maps to view. This was an experience in examining a collection that is fantastic in concept and disappointing in execution, at least in so far as allowing and providing access to information seekers.
Also for the "Seminar in Archives and Records Management: Electronic Records" (LIBR284) class, I wrote an analysis of a research project that provided guidelines for selection and preservation of geospatial electronic records. The guide also discussed legal issues involved with these types of records. This is the type of guide that I will likely use if I become a data curator for geoscience datasets.
Paper: "The Working Life of a Life Science Data Archivist"
An interesting assignment in "Resources and Information Services in the Disciplines and Professions: Science and Technology" (LIBR220) resulted in a chance to interview a former iSchool student, with whom I had worked in a group in "Beginning Cataloging and Classification" (LIBR248), who was now working at NASA Ames Research Center. I gained insight into the work of an archivist as opposed to a librarian, and how private archives view access (dimly). I learned about the nerve-wracking aspects of working with physical objects that cannot be backed up in multiple locations and are vulnerable to destruction.
Discussion Posts:
Two discussion posts from "Reference and Information Services" (LIBR210) are relevant to this competency. In the first one, I report on having interviewed a reference librarian about, among other things, the necessity of weeding and limited access policies for reference books.
The second post is a philosophical "thinking out loud" essay about access to collections and what I learned about that topic during the class. I ended up with more questions than answers, but they are the right questions to be thinking about, concerning personal bias and the ideal of providing equal access to information.
Top[12]
Collections must be maintained in order to remain useful. Metadata has to be checked for quality and kept up-to-date; data themselves have to be preserved, backed up, and transferred before media lifetimes run out. Sometimes the items in a collection need to be weeded out due to obsolescence or poor quality or, sometimes, for budgetary constraints. (The latter should be avoided whenever possible.)
My passion for maintaining collections reaches back to the first database I built and managed, a student information database for JHU-Hopkins-Nanjing (see "What Work Prepared Me to Understand and Perform This Competency?", Competency 5), runs through the current scientific datasets stored at my current job, and will continue into what I hope will be the next phase of my career, caring for geoscience/remote-sensing datasets. I am well-equipped to curate such datasets with the skills I have learned in my various jobs and in iSchool.
Top[12]
070323 MODIS GOES IR WV. (2007, March 23). Retrieved from http://cimss.ssec.wisc.edu/goes/blog/archives/date/2007/03/page/2
De Los Santos, V. (2014, May 8). Overview of CTAS. Retrieved from http://www.aviationsystemsdivision.arc.nasa.gov/research/foundations/index.shtml
The FedEx superhub. (2013, April 25). Retrieved from http://www.fedexlegends.info/superhub/hubhistory.html
Miller Jr., W. M. (1972). A Canticle for Liebowitz. New York, NY: Bantam Books.
Last updated: Friday, April 17, 2015
Back to top