I’ve been thinking a lot recently about remnants in language — speech patterns and phrases we still use long after we have forgotten their original purpose. (You guessed it: my studies on the formation and change of human language are still going strong.)
We repeat the same phrases without any real understanding of their original meaning. You might describe a cruel individual as "ruthless", even though the word "ruth" has fallen out of common parlance. Somehow these things live on despite their original use having been lost. Like junk DNA … or the appendix.
This made me think about historical remnants that live on today within data management.
Remnants in data management? How did that happen?
Remember those heady days when physical tape was the de facto standard medium for data transfer? What about when we had to use obtuse, condensed binary formats to make our data fit onto any available tape or file system.
The data was just too large to do anything else with it. All we ever did with that data was load it into software bought from a service company.
How did companies let this happen? Why was it okay to give away access to the data, to let it be controlled by service companies? If you see value in this data (and you certainly pay enough to acquire it), why not retain the ability to read it?
Parsing the problem
For the last few days I have been writing a parser for DLIS files — log files created by sensors used to monitor petroleum wells. Yes, seriously.
In the oil industry, we have a standard that we use for some really important — and really expensive — log data from wells. That standard is not very easy to read.
The DLIS format — stuck in 1991 as it is — is the digital embodiment of a hangover from a less technologically capable time. Technically, it is an open industry standard, officially owned by Energistics. The only documentation that seems to exist is the original RPC66 V1 documentation, which is far from easy to understand. As far as I can see, an open-source parser for this format is not readily available. In essence, you have to pay someone in order to read these files.
So why am I writing a DLIS parser? Well, somebody has to do it! I'm trying to create a petrophysics data management solution based on new paradigms — open-source software, big data technologies, repeatable data pipelines — and I would argue that DLIS is an important format to include.
“We don’t have to resort to the old rulebook”
Throughout the project, I continually remind myself that we don’t have to resort to the old rulebook: to follow remnants of the past that make little sense any more.
On reflection, it's perfectly adequate to translate this data to an ASCII format and to make it self-describing. Yes, the files will be larger, but so too are hard drives these days. The benefits far outweigh the storage cost: open access, the ability to index and search the data, the ability to use existing tools to manipulate the data. The list goes on…
We need to think about why we do what we do, and whether it still makes sense in today's world. If our actions no longer make sense, point it out to other people, and, most importantly, adapt.
Have you got relics in your data management? Shout about it by sharing this article — or even write your own (and drop me a link). You can also get in touch with me if you have similar data management challenges. I’d love to hear from you!
Oh, and if you happen to have source code for parsing DLIS files that you are willing to share with the world, I would very much like you to let me know.
连续 20 年:被公认为数据分析领域的领导者
随时了解情况
订阅 Teradata 的博客,获取每周向您提供的见解