Digifi ETL Autoload Primer

Autoload

Digifi’s autoload feature equips an analyst with self-service data prep to accelerate the most time-consuming (and sometimes most tedious) aspects of analysis. The result of using autoload is a significant improvement in productivity and accuracy.

Requirements

Autoload only works with a comma separated (CSV) file format. The required file naming convention is [your_file_name].csv.

Snapshot Dates

Every file and row loaded into the database is tagged with a `data_date` value representing the file extract date of the data. Autoload does not require you to include a date in your file name, in which case, autoload will date stamp the data in the database with the date at run time. However, if you would like to tag your data with a file or snapshot date, simply include the date in your file name per the naming convention [your_file_name_YYYYMMDD].csv where YYYY denotes the 4-digit year, MM the 2-digit month, and DD the 2-digit day.

How it Works

Autoload is a series of data management routines that trigger when a file is queued for processing. Unlike batch processing, where configuration files and strict naming conventions are enforced, autoload conforms file names to a Digifi standard, initiates the ETL engine, and logs progress to its queue without any prior configuration. Simply move a CSV file into the autoload directory and wait for processing to complete.

Default Settings

The default configuration for an autoload process is as follows:

Autoload Directory:
~/digifi/etl/autoload
Data Directory:
~/digifi/etl/data
Processing Time:
Approximately 45 seconds per MB (varies with AWS instance type)
Processing Scope:
Atomic (every element is scanned)
Database Permissions:
digifi:full privileges, _meta:limited privileges
Meta Database:
_meta
Meta Table:
data_type
Staging Database:
digifi
Staging Table:
stg_[your_file_name]
Load Database:
digifi
Load Table:
obj_[your_file_name]
Row Containing Field Names:
First row
Type of Field Names:
Text
Load Type:
Delete and insert
Autoload Log:
~/digifi/etl/log/QUEUE_YYYYMMDDHHMMSS.log
ETL Log:
~/digifi/etl/log/[YOUR_FILE_NAME]_YYYYMMDD.log
Daily Process Log:
~/digifi/etl/log/DIGIFI_YYYYMMDD.txt
 
Autoloading Your First File
aws_autoload_ss_01
  1. Connect via RStudio. Click here for help.
  2. Navigate to: digifi > etl > autoload
  3. Click ‘Upload`
  4. Click `Browse` to locate your file
  5. Click `OK`


aws_autoload_ss_02

  1. Autoload will start processing once your file upload has finished
  2. Your file is then moved to the data directory during processing
  3. The `queue` file persists while processing is underway (it will disappear when processing has finished)


aws_autoload_ss_03

  1. Click on `queue` to view processing details
  2. Note how file `sfdc_user.csv` was renamed to `sfdc_user_20151125.csv`


aws_autoload_ss_04

  1. For your information…
    You can kill your process for whatever reason
  2. To do so, use command:
    system(“extractstop”)


aws_autoload_ss_05

  1. Wait for the `queue` file to disappear
  2. Select File > New File > R Script
  3. Copy and paste the following syntax into the window:
    #!/usr/bin/env Rscript

    # Connect to database
    dbh <- getDbh()

    # Fetch query results into a data frame
    sql <- "DESCRIBE obj_[your_file_name]"
    df <- suppressWarnings(dbGetQuery(dbh, sql))

    View(df)



aws_autoload_ss_06

  1. Replace [your_file_name] with the actual file name
  2. Click the `Source` drop-down menu
  3. Select `Source with Echo`
  4. View how your file was loaded in the database
  5. Change your SQL statement to fetch the data you want in your analysis
  6. Copy and paste R syntax from a Digifi example to get started with your analysis