Digifi ETL Primer

ETL

Digifi’s ETL feature automatically cleans, describes and tracks your data so you can spend less time manipulating it and more time analyzing it. The result is a significant improvement in productivity and accuracy.

Requirements

The only global requirement is that you use a comma separated (CSV) file format. The required file naming convention is [your_file_name_YYYYMMDD].csv, where your file name uses all lower case characters and YYYYMMDD denotes the 4-digit year, 2-digit month, and 2-digit day of your extract file or data. As an example, a CRM `User` file extract for November 20, 2015 would be named `crm_user_20151120.csv`.

In these formats, the ETL module is smart enough to snapshot your data, catch variations in Excel, datetime, and numeric formats, and conform elements to an analytic standard. This saves you time from having to manually edit values and provides peace of mind that your data is clean for analysis.

Usage

You can load data via two methods: autoload or batch.

Autoload:
This is the easiest method and the most useful for loading files for a one-time analysis. Via RStudio, follow the steps:

  1. Navigate to the autoload directory (~/digifi/etl/autoload)
  2. Click `Upload` in the lower, right-hand menu bar
  3. Click `Browse` and select your file
  4. Click ‘OK’

The autoload procedure will begin immediately after your file is uploaded. A file named `queue` will appear and report the status of the current file being processed. You can place as many files in the autoload directory as you like. They will process sequentially. The `queue` file will disappear when processing is done. More information about autoload is located here.

Batch:
This is the preferred method for scheduled data loads. As `bighat`, follow the steps using the command line:

  1. Go to the properties directory:
    cd ~/digifi/etl/props
  2. Create a properties file:
    extractprops [module name] digifi
  3. Go to the ETL directory:
    cd ~/digifi/etl
  4. Add to the `config` file:
    MODULES=”[module name]”
  5. Run ETL:
    extractcall [run date (YYYYMMDD)]

For batch processing to work, you *MUST* follow the Digifi file naming convention (your_file_name_YYYYMMDD.csv), and the date passed to `extractcall` *MUST* match the YYYYMMDD date in the file name. More information about batch loading is located here.

Naming Conventions

The name you select for your raw data and props file has a direct relationship with ETL files and database objects.

Files you name:

  • Raw File Name:
    your_file_name_YYYYMMDD.csv
  • Props File (batch only):
    YOUR_FILE_NAME

Files or objects the system will name:

  • Extract File Name:
    YOUR_FILE_NAME_YYYYMMDD.extract
  • Staging File Name:
    YOUR_FILE_NAME_YYYYMMDD.staging
  • DB Staging Table:
    stg_your_file_name
  • DB Load Table:
    obj_your_file_name

Please note that file names are lowercase and module names are uppercase. Given the platform dependency on the file name, we strongly recommend establishing your own naming convention for your file names.

Customization

The props file (~/digifi/etl/props/YOUR_FILE_NAME) has file handling and database settings. You can override the default configuration by manually editing a given props file.