Dataverse: Spike - Support for Stata versions 14 and 15

Created on 10 Jan 2018  路  3Comments  路  Source: IQSS/dataverse

In Backlog Grooming 1/10, we tried to estimate #2301 and we weren't able to do so. The suggestion was to create a 1 point Spike with the goal of being able to estimate (together as one issue or as separate issues):

  • Stata 13 (may work in certain cases)
  • Stata 14
  • Stata 15

Most helpful comment

We currently maintain 2 Stata ingest plugins: one for Stata 13 (Stata's internal format "dta 117") and one for the older versions. Their v.13 format was re-engineered completely from scratch. It's very different from the older formats, so it warranted a new and separately maintained piece of ingest code.

Having reviewed the format documentation quickly, the good news is that the newer formats appear to be merely an extension of Stata 13; and not new developments. So we don't seem to need a new ingest plugin - rather we should be able to simply teach the current "new Stata" ingest to understand the latest flavors of the format.

There's been 2 format variations since v.13:

Stata 14 ("dta 118")
Stata 15 ("dta 119")

A very large portion of the v.14 format specification document appears to be 1:1 identical to the v.13 spec. I'm seeing some minor differences (For ex., in the later version, the number of observations is encoded as an 8 byte integer, in the v.13 it was 4). It'll take more careful work to identify all such differences, but it seems manageable.

The v.15 is explicitly advertised as exactly the same as v.14, with the single exception: the later format allows for more than 32K variables.

The conclusion is: it appears to be possible to add support for v.14 and 15 by extending and improving the already existing code. We should definitely add support for both of these at the same time (since v.15 is a minor extension of v.14). Whether this should be handled separately from improving/debugging the support for v.13 that we already provide can be discussed. (maybe?)

All 3 comments

Once again, we DO support Stata 13 already. Rather than "may work in certain cases" it's the other way around - there may be instances of Stata 13 files that we can't ingest. Whether because there's a bug in our implementation; or because something was missing from their documentation that was used to implement the ingest plugin - so we may be coming across some data encoding that we don't know how to handle. We can handle such cases as bug fixes/improvements of the Stata 13 ingest plugin. (we have a record in the prod. database of every Stata 13 file that failed to ingest and the files themselves - so it should be easy to diagnose and fix any such issues).
I used to say that we should always be careful to create realistic expectations; by communicating to the users that we cannot promise that we'll ever be able to ingest 100% of files in a certain format. By nature of working with somebody else's proprietary formats. But we can definitely work to make the success rate higher.
However, with Stata in particular, since they have been doing an exceptionally good job documenting their formats, it appears to be possible to achieve a success rate that is close to 100%.

We currently maintain 2 Stata ingest plugins: one for Stata 13 (Stata's internal format "dta 117") and one for the older versions. Their v.13 format was re-engineered completely from scratch. It's very different from the older formats, so it warranted a new and separately maintained piece of ingest code.

Having reviewed the format documentation quickly, the good news is that the newer formats appear to be merely an extension of Stata 13; and not new developments. So we don't seem to need a new ingest plugin - rather we should be able to simply teach the current "new Stata" ingest to understand the latest flavors of the format.

There's been 2 format variations since v.13:

Stata 14 ("dta 118")
Stata 15 ("dta 119")

A very large portion of the v.14 format specification document appears to be 1:1 identical to the v.13 spec. I'm seeing some minor differences (For ex., in the later version, the number of observations is encoded as an 8 byte integer, in the v.13 it was 4). It'll take more careful work to identify all such differences, but it seems manageable.

The v.15 is explicitly advertised as exactly the same as v.14, with the single exception: the later format allows for more than 32K variables.

The conclusion is: it appears to be possible to add support for v.14 and 15 by extending and improving the already existing code. We should definitely add support for both of these at the same time (since v.15 is a minor extension of v.14). Whether this should be handled separately from improving/debugging the support for v.13 that we already provide can be discussed. (maybe?)

Updated #2301 and #3339, closing.

Was this page helpful?
0 / 5 - 0 ratings