Digifi ETL Batch Primer

Batch

Digifi’s batch feature equips an analyst with scheduled data services for re-occurring data management and scoring tasks. Like the autoload feature, batch processing provides self-service data prep to accelerate the most time-consuming (and sometimes most tedious) aspects of analysis. The result is a significant improvement in productivity, accuracy, and reliability, especially when supporting deployments.

Requirements

Batch processing only works with a comma separated (CSV) file format. The required file naming convention is [your_file_name_YYYYMMDD].csv where YYYY denotes the 4-digit year, MM the 2-digit month, and DD the 2-digit day.

Snapshot Dates

Every file and row loaded into the database is tagged with a `data_date` value representing the file extract date of the data. The YYYYMMDD portion of your file name represents this date.

How it Works

Batch mode reads in an array of modules for processing. A series of data management routines is run for each module according to its given props file configuration settings. Unlike autoload, batch processing can be scheduled. This is usually preferred when processing daily or weekly files or when scoring data at regular intervals.

The command for running a batch process is:

extractcall [run date]

where run date is a YYYYMMDD formatted date. The `extractcall` script will load the module list specified in the ETL config file. The naming convention for the space separated module list is:

“MODULE_1:YYYYMMDD MODULE_2:YYYYMMDD … MODULE_N:YYYYMMDD”

where YYYYMMDD is an optional parameter denoting the file extract date of the data. Batch processing will snapshot your data with the run date passed to `extractcall` if the YYYYMMDD date parameter is absent in the module list.

For example, using the command

extractcall 20151005

with a module list defined as

“STOCK FUNDAMENTAL SPLIT:20150930”

would process the daily stock and fundamental files for 10/5/15 and the end-of-quarter split file for 9/30/15.

Props File

A props file contains module specific settings for database object names, raw data field names, and what type of database load process should be used. Create a props file using the command:

extractprops “[module 1] [module 2] … [module n]” [database name]

where the module name denotes the uppercase representation of your file name and the database name is the database you want to load to. The default configuration for a module is as follows:

Staging Database:
digifi
Staging Table:
stg_[your_file_name]
Load Database:
digifi
Load Table:
obj_[your_file_name]
Row Containing Field Names:
First row
Type of Field Names:
Text
Load Type:
Delete and insert
 

After creating a new props file, use a text editor to customize the settings.

Load Type

`Load Type` is a parameter setting in the props file. It has three options:

drop:
drop the current database object and recreate according to the current file layout,
truncate:
truncate the current database object prior to loading the current file,
insert:
delete rows where `data date` = `run date` and insert the current file.
 

Digifi ETL is smart enough to add new columns to the database object or ignore columns that are missing in your file. Accordingly, `insert` is usually sufficient. However, if you have field names that are constantly changing or a re-occurring analysis that does not require historical snapshotting, then `drop` is the preferred option.

Setting Up a Batch Process
aws_batch_ss_00
  1. Connect via RStudio. Click here for help.
  2. Navigate to: digifi > etl > data
  3. Click ‘Upload`
  4. Click `Browse` to locate the file
  5. Click `OK`
  6. (For this example, we will use the existing `random values` file and a `sfdc user` file uploaded with today’s date.)


aws_batch_ss_02

  1. Switch over to a SSH terminal
  2. Login as `bighat` with command:
    su bighat
  3. Go to the props directory with command:
    cd ~/digifi/etl/props/
  4. Create a props file with command:
    extractcall “SFDC_USER” digifi
  5. Note how the `SFDC_USER` props file is now present


aws_batch_ss_03

  1. Open `SFDC_USER` with a text editor
  2. Change the load table to `sfdc_user`
  3. Change the load type to `drop`
  4. Save and close


aws_batch_ss_04

  1. Go to the etl directory with command:
    cd ~/digifi/etl
  2. Open `config` with a text editor


aws_batch_ss_05

  1. Add modules to the `MODULES` array
  2. (Make sure the list is separated by spaces)
  3. (Also make sure modules without dates are listed first)
  4. Save and close


aws_batch_ss_06

  1. Run the batch process with command:
    extractcall 20151202
  2. View the daily process log with command:
    more ~/digifi/etl/log/DIGIFI_20151202.txt


aws_batch_ss_07

  1. Schedule the batch process with command:
    crontab -e
  2. Copy the following lines and paste into the editor:
    PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/bighat
    ETLDIR=/home/bighat/digifi/etl

    30 18 * * * extractcall 20151202

  3. Save and close


aws_batch_ss_08

  1. View the daily process log again with command:
    more ~/digifi/etl/log/DIGIFI_20151202.txt
  2. Note how processing started on schedule
  3. Learn more about crontab here.