Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6834231
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T23:05:10+00:00 2026-05-26T23:05:10+00:00

I’m working on a project right now where I have been slowly accumulating a

  • 0

I’m working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main “original_data” directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.

Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.

This seems like something that ProjectTemplate would have already, but I can’t seem to find that functionality there.

Does such a tool exist?

If it does not, what considerations should be taken into account to provide maximum flexibility? Some preliminary thoughts:

  1. A markup language should be used (YAML?)
  2. All sub-directories should be scanned
  3. To facilitate (2), a standard extension for a dataset descriptor should be used
  4. Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
  5. Ideally the report would be templated as well (e.g. “We pulled the [var] variable from [dset] dataset on [date].”), and possibly linked to Sweave.
  6. The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T23:05:10+00:00Added an answer on May 26, 2026 at 11:05 pm

    This is a very good question: people should be very concerned about all of the sequences of data collection, aggregation, transformation, etc., that form the basis for statistical results. Unfortunately, this is not widely practiced.

    Before addressing your questions, I want to emphasize that this appears quite related to the general aim of managing data provenance. I might as well give you a Google link to read more. 🙂 There are a bunch of resources that you’ll find, such as the surveys, software tools (e.g. some listed in the Wikipedia entry), various research projects (e.g. the Provenance Challenge), and more.

    That’s a conceptual start, now to address practical issues:

    I’m working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main “original_data” directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.

    Welcome to everyone’s nightmare. 🙂

    Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.

    No problem. list.files(...,recursive = TRUE) might become a good friend; see also listDirectory() in R.utils.

    It’s worth noting that filling in a methods section on data sources is a narrow application within data provenance. In fact, it’s rather unfortunate that the CRAN Task View on Reproducible Research focuses only on documentation. The aims of data provenance are, in my experience, a subset of reproducible research, and documentation of data manipulation and results are a subset of data provenance. Thus, this task view is still in its infancy regarding reproducible research. It might be useful for your aims, but you’ll eventually outgrow it. 🙂

    Does such a tool exist?

    Yes. What are such tools? Mon dieu… it is very application-centric in general. Within R, I think that these tools are not given much attention (* see below). That’s rather unfortunate – either I’m missing something, or else the R community is missing something that we should be using.

    For the basic process that you’ve described, I typically use JSON (see this answer and this answer for comments on what I’m up to). For much of my work, I represent this as a “data flow model” (that term can be ambiguous, by the way, especially in the context of computing, but I mean it from a statistical analyses perspective). In many cases, this flow is described via JSON, so it is not hard to extract the sequence from JSON to address how particular results arose.

    For more complex or regulated projects, JSON is not enough, and I use databases to define how data was collected, transformed, etc. For regulated projects, the database may have lots of authentication, logging, and more in it, to ensure that data provenance is well documented. I suspect that that kind of DB is well beyond your interest, so let’s move on…

    1. A markup language should be used (YAML?)

    Frankly, whatever you need to describe your data flow will be adequate. Most of the time, I find it adequate to have good JSON, good data directory layouts, and good sequencing of scripts.

    2. All sub-directories should be scanned

    Done: listDirectory()

    3. To facilitate (2), a standard extension for a dataset descriptor should be used

    Trivial: “.json”. 😉 Or “.SecretSauce” works, too.

    4. Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.

    As stated, this doesn’t quite make sense. Suppose that I take var1 and var2, and create var3 and var4. Perhaps var4 is just a mapping of var2 to its quantiles and var3 is the observation-wise maximum of var1 and var2; or I might create var4 from var2 by truncating extreme values. If I do so, do I retain the name of var2? On the other hand, if you’re referring to simply matching “long names” with “simple names” (i.e. text descriptors to R variables), then this is something only you can do. If you have very structured data, it’s not hard to create a list of text names matching variable names; alternatively, you could create tokens upon which string substitution could be performed. I don’t think it’s hard to create a CSV (or, better yet, JSON ;-)) that matches variable name to descriptor. Simply keep checking that all variables have matching descriptor strings, and stop once that’s done.

    5. Ideally the report would be templated as well (e.g. “We pulled the [var] variable from [dset] dataset on [date].”), and possibly linked to Sweave.

    That’s where others’ suggestions of roxygen and roxygen2 can apply.

    6. The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.

    Hmm, I’m stumped here. 🙂

    (*) By the way, if you want one FOSS project that relates to this, check out Taverna. It has been integrated with R as documented in several places. This may be overkill for your needs at this time, but it’s worth investigating as an example of a decently mature workflow system.


    Note 1: Because I frequently use bigmemory for large data sets, I have to name the columns of each matrix. These are stored in a descriptor file for each binary file. That process encourages the creation of descriptors matching variable names (and matrices) to descriptors. If you store your data in a database or other external files supporting random access and multiple R/W access (e.g. memory mapped files, HDF5 files, anything but .rdat files), you will likely find that adding descriptors becomes second nature.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a jquery bug and I've been looking for hours now, I can't
this is what i have right now Drawing an RSS feed into the php,
I have a bunch of posts stored in text files formatted in yaml/textile (from
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I have a text area in my form which accepts all possible characters from
I am trying to loop through a bunch of documents I have to put
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have just tried to save a simple *.rtf file with some websites and
For some reason, after submitting a string like this Jack’s Spindle from a text
I have this code to decode numeric html entities to the UTF8 equivalent character.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.