Model

Model is one simple JSON file, describing what type of data should be generated, where and allows to specify some specific options.

Structure of a model

A model file is composed of 4 sections (each of this section is an array):

{
  "Fields": [

  ],
  "Table_Names": [

  ],
  "Primary_Keys": [

  ],
  "Options": [

  ]
}
  • Fields list all fields (columns) you want to generate with their type etc…
  • Table_Names is an array of keys/values to define where data should be generated
  • Primary_Keys is an array of keys/values to define what primary keys will be used for kafka, kudu, hbase
  • Options is an array of keys/values to define some specific properties (such as replication factor, buffer etc..)

Let’s deep dive into each section below.

Fields

Fields is a list of Field object.

A field is an object consisting of at least two required parameters:

  • name: name of the field
  • type: Type of the field

Then, multiple optional parameters that could depend on its type:

  • min
  • max
  • length
  • possible_values
  • possible_values_weighted
  • filters
  • conditionals

Let’s explorate the different types of fields and their possible parameters.

Types of Fields - Basic

Fields can be of many different types, here are the basic ones, self-explicit:

  • STRING an alphaNumeric string (length represents length of string, by default 20 if not set)
  • STRINGAZ an alpha non-numeric string (length represents length of string, by default 20 if not set)
  • INTEGER
  • INCREMENT_INTEGER An integer increment for each row
  • INCREMENT_LONG A long incremented for each row
  • BOOLEAN
  • FLOAT
  • LONG
  • TIMESTAMP
  • BYTES length represents length of byte array, by default 20
  • HASHMD5 hash of a random string (length represents size of byte array, by default 32)
  • BLOB byte array of default 1MB (length represents length of byte array) (Use it carefully)

Some Examples:

{
    "name": "size",
    "type": "INTEGER"
}
{
    "name": "bool",
    "type": "BOOLEAN"
}
{
    "name": "startDate",
    "type": "TIMESTAMP"
}

Some Examples with min, max, length:

An integer between 18 and 99:

{
    "name": "age",
    "type": "INTEGER",
    "min": 18,
    "max": 99
}

A bytes array of 10 bytes:

{
    "name": "bytesLittleArray",
    "type": "BYTES",
    "length" : 10
}

Examples with possible_values:

A string picked between values defined in possible_values:

{
    "name": "department",
    "type": "STRING",
    "possible_values": ["hr", "consulting", "marketing", "finance"]
}

A string picked between values defined in possible_values_weighted, each has a different weights (and sum of all is 100): (In that case, there will 70% of BRONZE, 20% of SILVER, 8% of GOLD, 2% of PLATINUM)

{
  "name": "membership",
  "type": "STRING",
  "possible_values_weighted": {
    "BRONZE": 70,
    "SILVER": 20,
    "GOLD": 8,
    "PLATINUM": 2
  }
}

Types of Fields - Advanced

These are more “advanced” types :

  • BIRTHDATE a date between 1910 & 2020 (but you can set your own limits)
  • NAME a first name taken from a dictionary of over 20,000+ names (can be filtered by country)
  • COUNTRY a country name taken from a dictionary
  • PHONE NUMBER A 10 digits with international indicator in front (can be filtered by country)_
  • EMAIL _string as in form of (.|)@(gaagle.com|yahaa.com|uutlook.com|email.fr)_
  • IP a string representing an IP in form of Ipv4: 0-255.0-255.0-255.0-255
  • UUID an unique universal identifier: xxxx-xxxx-xxxx-xxxx
  • CITY an object representing an existing city (name, lat, long, country) made from a dictionary of over 10,000+ cities, only the name is taken for this field (can be filtered by country)
  • CSV an object taken from a given CSV file
  • LINK a string whose values is derived from another field, currently from a CITY or CSV field

Some basic examples:

{
  "name": "name",
  "type": "NAME",
  "filters": ["USA"]
}
{
  "name": "birthdate",
  "type": "BIRTHDATE",
  "min": "1/1/1955",
  "max": "1/1/1999"
}

Field City

City is a special field that loads a dictionnary of 40K+ cities over the world with associated latitude, longitude and country.

It can be filtered by one or more country.

This below example creates 4 fields:

  • City name (only in France and Spain)
  • Latitude of this city (available as lat)
  • Longitude of this city (available as long)
  • Country where this city is (available as country)
{
  "name": "city",
  "type": "CITY",
  "filters": ["France", "Spain"]
},
{
  "name": "city_lat",
  "type": "LINK",
  "conditionals": {
    "link": "$city.lat"
  }
},
{
  "name": "city_long",
  "type": "LINK",
  "conditionals": {
    "link": "$city.long"
  }
},
{
  "name": "city_country",
  "type": "LINK",
  "conditionals": {
    "link": "$city.country"
  }
}

Field CSV

It is a special Field that will read a CSV provided by its path, load it into memory, parse it.

It is able to apply filters on this, and you can create other fields derived from this one.

For example, we have this CSV in /opt/cloudera/parcels/DATAGEN/dictionaries/person_test.csv :

name;department;country
francois;PS;France
kamel;SE;France
thomas;RH;Germany
sebastian;PS;Spain

We can create two Fields:

  • One will be the name of the person (filtered on the country that should be France)
  • The department of this person
{
  "name": "person",
  "type": "CSV",
  "filters": ["country=France"],
  "file": "/opt/cloudera/parcels/DATAGEN/dictionaries/person_test.csv",
  "field": "name"
},
{
  "name": "person_department",
  "type": "LINK",
  "conditionals": {
    "link": "$person.department"
  }
}

Conditionals - Formula

Conditionals is an object that allows you to define fields that are depending from others.

Formula, is a formula to evaluate where ${field_name} are replaced with their values, for example:

{
  "name": "starting_hour",
  "type": "INTEGER",
  "min": 0,
  "max": 16
},
{
  "name": "finished_hour",
  "type": "INTEGER",
  "conditionals": {
    "formula": "$starting_hour + 8"
  }
}

__ A formula is evaluated using a java script evaluator (inside an Engine Manager of Java), hence it can have complex compute and even if else statements__

Conditionals - Injection

Conditionals is an object that allows you to define fields that are depending from others.

Injection, is a string where ${field_name} are replaced with their values, for example:

{
  "name": "email",
  "type": "STRING",
  "conditionals": {
    "injection": "${name}@company.it"
  }
}

Conditionals - Conditions Line

Conditionals is an object that allows you to define fields that are depending from others.

Conditions Lines are a bunch of lines evaluated one after the other, if one is true, then value is set to right expression.

Each conditional line is composed of conditions in the form of a field name (reported by a $) which is substituted by its value and operators (<,>,=,!=) that will check against a defined value or a field (which is also substituted). The condition line can be composed of multiple checks using & (AND) or | (OR) operators.

An example:

{
  "name": "rain",
  "type": "STRING",
  "conditionals": {
    "$humidity_9_am>70 & $temperature_9_am<20 & $wind_force_9_am<80" : "true",
    "$humidity_9_pm>70 & $temperature_9_pm<20 & $wind_force_9_am<80" : "true",
    "$wind_provenance_9_am=NORTH & $wind_force_9_am>80" : "true",
    "$wind_provenance_9_pm=NORTH & $wind_force_9_pm>80" : "true",
    "$humidity_9_pm>70 & $temperature_9_pm<25 & $pressure_9_pm<1010": "true",
    "$humidity_9_am>70 & $temperature_9_am<25 & $pressure_9_am<1010": "true",
    "default" : "false"
  }
}

Ghosts

Any field can be a ghost field. It means that this field will be computed but not yield at the end.

It is useful to generate one field formed of aggregation or dependent on other fields which should not be yield.

An example on how to create an address using ghost fields and concatenation:

{
  "name": "number",
  "type": "INTEGER",
  "min": 0,
  "max": 100,
  "ghost": "true"
},
{
  "name": "street",
  "type": "STRING",
  "possible_values": ["street", "avenue", "boulevard"],
  "ghost": "true"
},
{
  "name": "name",
  "type": "NAME",
  "ghost": "true"
},

{
  "name": "address",
  "type": "STRING",
  "conditionals": {
    "injection": "${number} ${street} ${name}"
  }
}

In this example, fields number, street and name are generated but not printed, they only serve to create the address field.

Table_Names

These are all available keys to configure where data should be generated:

  • HDFS_FILE_PATH
  • HDFS_FILE_NAME
  • HBASE_TABLE_NAME
  • HBASE_NAMESPACE
  • KAFKA_TOPIC
  • OZONE_VOLUME
  • OZONE_BUCKET
  • OZONE_KEY_NAME
  • OZONE_LOCAL_FILE_PATH
  • SOLR_COLLECTION
  • HIVE_DATABASE
  • HIVE_HDFS_FILE_PATH
  • HIVE_TABLE_NAME
  • HIVE_TEMPORARY_TABLE_NAME
  • KUDU_TABLE_NAME
  • LOCAL_FILE_PATH
  • LOCAL_FILE_NAME
  • AVRO_NAME

Primary_Keys

These are all the available keys to configure for some services:

  • KAFKA_MSG_KEY
  • HBASE_PRIMARY_KEY
  • KUDU_PRIMARY_KEYS
  • KUDU_HASH_KEYS
  • KUDU_RANGE_KEYS

Options

This are all the available keys to configure basic settings for some services:**

  • HBASE_COLUMN_FAMILIES_MAPPING This mapping must be in the form : “CF:col1,col2;CF2:col5”
  • SOLR_SHARDS
  • SOLR_REPLICAS
  • KUDU_REPLICAS
  • ONE_FILE_PER_ITERATION
  • KAFKA_MESSAGE_TYPE
  • KAFKA_JAAS_FILE_PATH
  • SOLR_JAAS_FILE_PATH
  • HIVE_THREAD_NUMBER
  • HIVE_ON_HDFS
  • HIVE_TABLE_TYPE Can be External, or Managed or Iceberg
  • HIVE_TABLE_FORMAT Can be Parquet, ORC, Avro, JSON
  • HIVE_TEZ_QUEUE_NAME
  • HIVE_TABLE_PARTITIONS_COLS This must be a comma separated list of cols : “col1,col2”
  • HIVE_TABLE_BUCKETS_COLS This must be a comma separated list of cols : “col1,col2”
  • HIVE_TABLE_BUCKETS_NUMBER
  • CSV_HEADER
  • DELETE_PREVIOUS
  • PARQUET_PAGE_SIZE
  • PARQUET_ROW_GROUP_SIZE
  • PARQUET_DICTIONARY_PAGE_SIZE
  • PARQUET_DICTIONARY_ENCODING
  • KAFKA_ACKS_CONFIG
  • KAFKA_RETRIES_CONFIG
  • KUDU_BUCKETS
  • KUDU_BUFFER
  • KUDU_FLUSH
  • OZONE_REPLICATION_FACTOR
  • HDFS_REPLICATION_FACTOR

Example on how to create a model ?

Let’s create a simple model to generate some data into Hive file:

I would like to generate something that will represent employees:

  • A name
  • Their location city
  • Their birthdate
  • Their phone number
  • Years of experience in the company
  • Their employee ID (in 6 digits)
  • Their department (among HR, CONSULTING, FINANCE, SALES, ENGINEERING, ADMINISTRATION, MARKETING)

And the company is based in Germany, as all employees by the way.

So here is the final JSON I outcome:

{
    "Fields": [
      {
        "name": "name",
        "type": "NAME",
        "filters": ["Germany"]
      },
      {
        "name": "city",
        "type": "CITY",
        "filters": ["Germany"]
      },
      {
        "name": "phone_number",
        "type": "PHONE",
        "filters": ["Germany"]
      },
      {
        "name": "years_of_experience",
        "type": "INTEGER",
        "min": 0,
        "max": 10
      },
      {
        "name": "employee_id",
        "type": "INCREMENT_INTEGER",
        "min": 123456
      },
      {
        "name": "department",
        "type": "STRING",
        "possible_values": ["HR", "CONSULTING", "FINANCE", "SALES", "ENGINEERING", "ADMINISTRATION", "MARKETING"]
      }
    ],
    "Table_Names": [
        {"HIVE_HDFS_FILE_PATH": "/user/datagen/hive/employee_model/"},
        {"HIVE_DATABASE": "datagen_test"},
        {"HIVE_TABLE_NAME":  "employee_model"},
        {"HIVE_TEMPORARY_TABLE_NAME":  "employee_model_tmp"},
        {"AVRO_NAME":  "datagenemployee"}
    ],
    "Primary_Keys": [
    ],
    "Options": [
    ]
  }

Test a Model

To test a model before launching a data generation, it is possible to use the API to test it.

Under model-tester-controller, an API /model/test takes as input a path to a model or directly upload a model and it returns one row generated with this model.

Output is:

{ "name" : "Gerhilt", "city" : "Beelen", "phone_number" : "+49 299776078", "years_of_experience" : "2", "employee_id" : "123457", "department" : "FINANCE" }

Launch Data Generation

Now, we are ready, using the swagger or making direclty an API call (with curl, postman or anything else), we launch a data generation like this:

Command in the swagger:

curl -X POST "https://ccycloud-1.lisbon.root.hwx.site:4242/datagen/hive" -H  "accept: */*" -H  "Content-Type: multipart/form-data" -F "batches=10" -F "model_file=@model-test.json;type=application/json" -F "rows=10000" -F "threads=10"

Returns following UUID:

{ "commandUuid": "1567dfba-a8f9-4da9-b389-9bc30f4ec1d5" , "error": "" }

In Datagen Webserver logs, we can see at the end:

Let’s Verify

If you log into hue with enough privileges (or beeline), we have a new database: datagen_test with a table employee_model and some data in it: