Veronika M. Megler


Summary

I am currently (late 2014) a post-doctoral fellow in Computer Science at the Maseeh College of Engineering & Computer Science, Portland State University, in Portland Oregon.

Since Fall 2009, I have worked with Dr. Dave Maier at Portland State University, and with the Center for Margin Observation and Prediction (CMOP), part of OHSU.

I received my PhD in June, 2014. My dissertation topic is "Ranked Similarity Search of Scientific Datasets: An Information Retrieval Approach". My thesis committee consisted of:

My research interests include adoption and applications of emerging technologies; big data, information management, data access and discovery; spatio-temporal databases; integration of biological and physical data; and scientific data.

I am currently working on two projects:

My email at PSU is: vmegler at the web server address of cs dot pdx dot edu. (Link to my scholarly publication list)

"Data Near Here"

"Data Near Here" applies concepts from the field of Information Retrieval and Internet search to massive archives of scientific datasets. I address the following problem: with the explosion of data collected by scientists and stored in many files, many formats, many naming conventions, how do scientists find data that matches their research needs?

I use a running example of a scientist searching for salinity observations collected in of May 2009, near the Astoria-Megler bridge. (Evidence of it running over CMOP's archive is in the screenshot below.) Note that in this case, there are no exact matches for the scientist's search terms as formulated; given no exact matches, the tool presents an ordered list with the "nearest" matches at the top.

High-level Dataset Search Architecture

Similar in concept to the way an Internet text search engine operates, I focus on providing a set of results ranked by similarity to a scientist's search; however, rather than text webpages, my users are searching for scientific (primarily numeric) data. I assume that after reviewing the search results, the scientist will wish to download, visualize or otherwise process selected datasets using other tools. Thus, the search engine is complementary to existing analysis and visualization technologies.

How it Works:

A set of crawlers scan an archive of datasets, asynchronously. We create a brief summary of the contents of each dataset, and store them in a metadata catalog using a simple, consistent abstraction. The current prototype handles several different file types, and the scanning process can be easily extended to handle additional file types and formats.

The user enters search criteria into a UI. (Note: "I am not a UI designer, and this is not the topic of my research.") A search engine searches over the metadata and returns ranked search results of the "closest matches" to the query, in real-time. Searches can include location, time, variable names of interest, or desired ranges for the data values. The results are displayed in a list (and, if geolocation information is available, on a map), along with brief summary information. The results can be downloaded for analysis or plotted in linked data analysis or visualization tools. A link leads to a page that shows the full metadata available for that dataset, thus providing the scientist with additional information upon which to make analysis decisions, if desired.

Publications

"Data Near Here" is described in the following publications:

  • The best high-level overview of the work is: Megler, V.M. and Maier, D.: Data Near Here: Bringing Relevant Data Closer to Scientists. In: Computing in Science and Engineering, March 2013, Volume 15, No. 3. A preprint version can be downloaded here by permission.
  • Two user studies that show the effectiveness of the approach are described in Are Datasets Like Documents?: Evaluating Similarity-Based Ranked Search Over Scientific Data. In: TKDE: Transactions on Knowledge and Data Engineering, Volume 27, Issue 1, pp 32-45. January 1, 2015. An author's copy can be downloaded here by permission, and the referenced appendices are here.
  • My most recent publication is "Demonstrating Data Near Here", to appear in SIGMOD 2015.
  • Scalability and performance challenges, based on the current prototype implementation, are discussed here: Maier, D., Megler, V.M., Tufte, K.: Challenges for Dataset Search (keynote). Presented at the International Conference on Database Systems for Advanced Applications, Bali, Indonesia April (2014). Lecture Notes in Computer Science Volume 8421, 2014, pp 1-15. It can also be downloaded here (by permission).
  • The issues that result from "variable diversity" - the many things a single variable may be named - are described in: Megler, V.M. (Supervised by David Maier): Taming the Metadata Mess. In: ICDE Brisbane Workshops (PhD Symposium), April, 2013. It can also be downloaded here (by permission). The poster (which has additional diagrams/material not in the paper) is here.
  • An early version of the underlying model is described in:
    ACM DL Author-ize serviceWhen big data leads to lost data, V. M. Megler, David Maier
    PIKM '12 Proceedings of the 5th Ph.D. workshop on Information and knowledge, 2012. Winner of PIKM "Best Paper" award. (The updated model is described in my dissertation.)
  • How it fits in the context of the archive it was originally built for is described in: Maier, D., Megler, V.M., Baptista, A., Jaramillo, A., Seaton, C., Turner, P.: Navigating Oceans of Data. In: Ailamaki, A. and Bowers, S. (eds.) Scientific and Statistical Database Management. Lecture Notes in Computer Science, Volume 7338, pp. 1-19. Springer Berlin / Heidelberg (2012). The original publication is available at SpringerLink, or can be downloaded here (by permission).
  • Megler, V.M., Maier, D.: Finding Haystacks with Needles: Ranked Search for Data Using Geospatial and Temporal Characteristics. In: Bayard Cushing, J., French, J., and Bowers, S. (eds.) Lecture Notes in Computer Science, Volume 6809, pp. 55-72. Springer Berlin / Heidelberg (2011). The original publication is available at SpringerLink, or can be downloaded here (by permission). Selectivity: 34% acceptance rate for long papers.
  • "Data Near Here: Ranked Geospatial-Temporal Search of Scientific Data", V.M. Megler and David Maier, presented at Symposium on Space-Time Integration in Geography and GIScience, AAG 2011, April 2011
  • "Data Near Here: Ranked Geospatial-Temporal Search of Scientific Data (Take 1)", Veronika Megler and David Maier, GIS In Action 2011, March 2011

... and, of course, at great length in my dissertation (of which sections, surprisingly, strongly resemble the above listed papers. But there is new content there, too). [local copy]

A patent "A Search Tool that Utilizes Numerical Scientific Metadata Matched Against User-Entered Parameters Edit", United States Patent US8560531 B2 was issued on issued October 15, 2013 (filed July 1, 2011). Inventors: Veronika Megler, David Maier; Joint IBM/Portland State University.

Current 'Data Near Here' Activities

Data Near Here is in production at CMOP, for use by registered users only. It will be opened to outside users in the (hopefully near) future (reasons for the delay can be found here). The CMOP production implementation currently focuses primarily on CMOP's own data archive; data from other archives may be searchable via this implementation in the future.

A research prototype is available (well, it's available when it's working) here. I wish to preserve my freedom of action (i.e., to break it again). The interactive portion is written in PHP, Javascript, JQuery, accessing a PostgreSQL database. The crawlers are in Python. Technologies were chosen for ease of prototyping and to fit in with CMOP's standards, and may not, in fact, be the best choices for this kind of application.

We are now exploring application of the same ideas, concepts and (hopefully) code-base to genomics data.

"Portland Observatory"

This is a new research project, intended to explore how one might architect and build an observatory that understands and adapts to the wide variety of data gathered or otherwise available in a single domain. The project uses our local city of Portland, Oregon, as a laboratory and example within which to explore these concepts.

One use case is described in Guiding Data-Driven Transportation Decisions. To appear at BDUIC 2014, the Big Data and Urban Informatics Workshop, UIC, Chicago, IL, August 11-12, 2014.

Other Projects

I am involved in other "random" (fun) research-related activities. They include:

Past Lives

My last industry position was Executive IT Architect at IBM, prior to leaving to pursue my PhD. Here are links for: a) a summary of recent accomplishments and b) a list of my industry publications.

My 15 minutes of fame (well, it's probably 30 minutes by now) are documented in many places, including: Tolkien Gateway (which references a parody of our game I didn't know about), World of Spectrum (with a picture I have no memory of), and a Wikipedia entry (none of which I wrote). There's even a walkthrough, here.

As of "the day I last checked", it was the second most downloaded item in the Internet Archive's Historical Software Collection, where you can 'play it again' in an emulator. And someone spent far longer than it took Phil and me to write it on reverse-engineering it (from the bytecode, no less) and exposing the innards (Wilderland).

I hereby apologize for making Thorin spend so much of his time singing about gold...

Recent (extensive) media coverage includes:

... and I even wrote a recent blog article for the Australian Center for the Moving Image, Ruminations on The Hobbit Fandom. Here is their profile of me. There's also a paper I presented at the 'Born Digital and Cultural Heritage Conference', ACMI, Melbourne Australia, in June 2014.

In Memoriam: Dr. Vendelin R. Megler, 1921-2011



Last updated, January 2015.