Data Generation - 2nd part

Using Cloudera Manager and pre-defined actions, data can be generated into all kind of services running in your platform.

Data has been generated in HDFS, Hive, Ozone, HBase.

Before going further and generate data into other services, it worths to take a closer look at what kind of data has been generated.

Introduction to Models

Heart of Datagen is the Model.

Each time that you want to generate data, Datagen will require a Model (or default to the default model).

A Model is a JSON file that defines what your data should look like.

As of now, you only used pre-defined models, but entire goal of Datagen is to let you provide your own model.

We will continue a little bit with pre-defined models, but we will get you through the process of creating model and all entire possibilities offered by the tool in the next section about models.

Pre-defined Models

As you may have guessed data generated in previous services is following some pre-defined models.

You can find these models on all machines where Datagen parcel has been deployed under directory: /opt/cloudera/parcels/DATAGEN/models/ .

Here is the list of all model files you can find in the parcel or in the source code under: src/main/resources/models/ .

ll -R /opt/cloudera/parcels/DATAGEN/models/
/opt/cloudera/parcels/DATAGEN/models/:
total 28
drwxr-xr-x 2 root root 4096 Oct 12 02:57 customer
-rw-r--r-- 1 root root 2111 Oct 12 02:57 example-model.json
drwxr-xr-x 2 root root 4096 Oct 13 00:47 finance
-rw-r--r-- 1 root root 5926 Oct 12 02:57 full-model.json
drwxr-xr-x 2 root root 4096 Oct 13 00:47 industry
drwxr-xr-x 2 root root 4096 Oct 13 00:47 public_service

/opt/cloudera/parcels/DATAGEN/models/customer:
total 36
-rw-r--r-- 1 root root 2144 Oct 12 02:57 customer-china-model.json
-rw-r--r-- 1 root root 2154 Oct 12 02:57 customer-france-model.json
-rw-r--r-- 1 root root 2155 Oct 12 02:57 customer-germany-model.json
-rw-r--r-- 1 root root 2150 Oct 12 02:57 customer-india-model.json
-rw-r--r-- 1 root root 2150 Oct 12 02:57 customer-italy-model.json
-rw-r--r-- 1 root root 2152 Oct 12 02:57 customer-japan-model.json
-rw-r--r-- 1 root root 2150 Oct 12 02:57 customer-spain-model.json
-rw-r--r-- 1 root root 2153 Oct 12 02:57 customer-turkey-model.json
-rw-r--r-- 1 root root 2147 Oct 12 02:57 customer-usa-model.json

/opt/cloudera/parcels/DATAGEN/models/finance:
total 4
-rw-r--r-- 1 root root 1748 Oct 12 02:57 transaction-model.json

/opt/cloudera/parcels/DATAGEN/models/industry:
total 12
-rw-r--r-- 1 root root 1712 Oct 12 02:57 plant-model.json
-rw-r--r-- 1 root root 1476 Oct 12 02:57 sensor-data-model.json
-rw-r--r-- 1 root root 1549 Oct 12 02:57 sensor-model.json

/opt/cloudera/parcels/DATAGEN/models/public_service:
total 16
-rw-r--r-- 1 root root 1899 Oct 12 02:57 incident-model.json
-rw-r--r-- 1 root root 2445 Oct 12 02:57 intervention-team-model.json
-rw-r--r-- 1 root root 3445 Oct 12 02:57 weather-model.json
-rw-r--r-- 1 root root 2289 Oct 12 02:57 weather-sensor-model.json

HDFS & Ozone

HDFS & Ozone buttons created 1 million customers from different countries (using the different customer models under /opt/cloudera/parcels/DATAGEN/models/customer/) and pushed them in Parquet file.

Sample of data in JSON format:

{ "name" : "Loris", "id" : "790001", "birthdate" : "1987-01-11", "city" : "Stevensville", "country" : "USA", "email" : "Loris@company.us", "phone_number" : "+1 7225688066", "membership" : "SILVER" }
{ "name" : "Marcell", "id" : "490001", "birthdate" : "1950-06-22", "city" : "Pontecorvo", "country" : "Italy", "email" : "Marcell@company.it", "phone_number" : "+39 995887416", "membership" : "BRONZE" }
{ "name" : "Ryong", "id" : "520001", "birthdate" : "1941-02-05", "city" : "Yachiyo", "country" : "Japan", "email" : "Ryong@company.jp", "phone_number" : "+81 809127101", "membership" : "PLATINUM" }

HBase

HBase button created a 1 million transactions (using the transaction model under /opt/cloudera/parcels/DATAGEN/models/finance/transaction-model.json).

Sample of data in JSON format:

{ "sender_id" : "50902", "receiver_id" : "10391", "amount" : "0.8084345", "execution_date" : "1665728236778", "currency" : "EUR" }
{ "sender_id" : "21403", "receiver_id" : "68104", "amount" : "0.65117764", "execution_date" : "1665728285129", "currency" : "USD" }

Hive

Hive button created a 1 million sensors data (using different models under /opt/cloudera/parcels/DATAGEN/models/industry/).

It will generate 100 plants data like this:

{ "plant_id" : "1", "city" : "Bollene", "lat" : "44,2803", "long" : "4,7489", "country" : "France" }

It will generate 100 000 sensors like this (each can be linked to a plant):

{ "sensor_id" : "1", "sensor_type" : "humidity", "plant_id" : "690" }

It will generate 1 000 000 sensors data like this (each can be linked to a sensor):

{ "sensor_id" : "58764", "timestamp_of_production" : "1665728724586", "value" : "-3000244563995128335" }

Local files

In Cloudera Manager:

Datagen > Actions > Generate Local data as CSV, JSON, AVRO, ORC, PARQUET

It launches a Cloudera Manager command making different API calls to Datagen Web server to generate multiple data using almost all possible models.

Output should be:

Let’s Verify

In a shell with a logged in user (optionally use datagen ones):

 cat /home/datagen/customer/customer-fr-0000000000.json
{ "name" : "Josse", "id" : "120001", "birthdate" : "2001-08-03", "city" : "Meylan", "country" : "France", "email" : "Josse@company.fr", "phone_number" : "+33 444585074", "membership" : "BRONZE" }
{ "name" : "Piet", "id" : "120002", "birthdate" : "1970-06-17", "city" : "Bures-sur-Yvette", "country" : "France", "email" : "Piet@company.fr", "phone_number" : "+33 851063627", "membership" : "BRONZE" }
{ "name" : "Armand", "id" : "120003", "birthdate" : "1990-10-04", "city" : "Notre-Dame-de-Gravenchon", "country" : "France", "email" : "Armand@company.fr", "phone_number" : "+33 575158362", "membership" : "BRONZE" }
{ "name" : "Marvin", "id" : "120004", "birthdate" : "1960-10-04", "city" : "Saint-Pryve-Saint-Mesmin", "country" : "France", "email" : "Marvin@company.fr", "phone_number" : "+33 588241506", "membership" : "BRONZE" }
{ "name" : "Vivian", "id" : "120005", "birthdate" : "1994-04-28", "city" : "La Cadiere-d'Azur", "country" : "France", "email" : "Vivian@company.fr", "phone_number" : "+33 553370858", "membership" : "BRONZE" }
{ "name" : "Jakob", "id" : "120006", "birthdate" : "1976-08-02", "city" : "Chaville", "country" : "France", "email" : "Jakob@company.fr", "phone_number" : "+33 208782811", "membership" : "BRONZE" }
{ "name" : "Bo", "id" : "120007", "birthdate" : "1966-10-14", "city" : "Brignoles", "country" : "France", "email" : "Bo@company.fr", "phone_number" : "+33 068739422", "membership" : "PLATINUM" }
{ "name" : "Emilienne", "id" : "120008", "birthdate" : "1976-02-23", "city" : "Orange", "country" : "France", "email" : "Emilienne@company.fr", "phone_number" : "+33 303877991", "membership" : "BRONZE" }
{ "name" : "Elise", "id" : "120009", "birthdate" : "1965-11-28", "city" : "Cosne sur Loire", "country" : "France", "email" : "Elise@company.fr", "phone_number" : "+33 540812701", "membership" : "SILVER" }
{ "name" : "Roelof", "id" : "120010", "birthdate" : "1982-06-01", "city" : "Magny-en-Vexin", "country" : "France", "email" : "Roelof@company.fr", "phone_number" : "+33 252194443", "membership" : "BRONZE" }

cat /home/datagen/finance/transaction/transaction-0000000000.csv

sender_id,receiver_id,amount,execution_date,currency
"11292","27627","0.7721951","1665729006111","USD"
"49294","95851","0.4893235","1665729006111","EUR"
"68670","8844","0.009439588","1665729006111","USD"
"61487","46071","0.22023022","1665729006111","EUR"
"14383","57358","0.07566887","1665729006111","YEN"
"89570","96238","0.35353237","1665729006111","USD"
"66066","69065","0.87496656","1665729006111","USD"
"43894","87454","0.11435127","1665729006111","USD"
"76777","19367","0.06878656","1665729006111","EUR"
"53649","14975","0.9570634","1665729006111","EUR"

ls -R /home/datagen/industry/
/home/datagen/industry/:
plant  sensor  sensor_data

/home/datagen/industry/plant:
plant-0000000000.avro

/home/datagen/industry/sensor:
sensor-0000000000.parquet

/home/datagen/industry/sensor_data:
sensor_data-0000000000.orc

SolR

In Cloudera Manager:

Datagen > Actions > Generate 1 Million Weather Data to SolR

It launches a Cloudera Manager command making different API calls to Datagen Web server to generate multiple data using almost all possible models.

Output should be:

It will generate 1 million weather data like this (using the weather model under /opt/cloudera/parcels/DATAGEN/models/public_service/weather-model.json)

{ "city" : "Seysses", "date" : "2021-03-25", "lat" : "43,4981", "long" : "1,3125", "wind_provenance_9_am" : "NORTH", "wind_force_9_am" : "3", "wind_provenance_9_pm" : "WEST", "wind_force_9_pm" : "12", "pressure_9_am" : "1004", "pressure_9_pm" : "1008", "humidity_9_am" : "46", "humidity_9_pm" : "52", "temperature_9_am" : "22", "temperature_9_pm" : "-8", "rain" : "false" }

Let’s Verify

Access SolR UI, (login as a user with enough rights):

Kudu

In Cloudera Manager:

Datagen > Actions > Generate 1 Million Public Service Data to Kudu

It launches a Cloudera Manager command making different API calls to Datagen Web server to generate multiple data using almost all possible models.

Output should be:

It will generate 1 million weather data like this (using the weather model under /opt/cloudera/parcels/DATAGEN/models/public_service/incident-model.json )

{ "city" : "Le Rove", "lat" : "43,3692", "long" : "5,2503", "reporting_timestamp" : "1665732947892", "emergency" : "URGENT", "type" : "WATER" }

Let’s Verify

Go to Hue or an Impala shell and make an INVALIDATE METADATA command to refresh the cache, then you will be able to see in database: datagen a new table publicservice_incident :

Kafka

Datagen > Actions > Generate 1 million weather data to Kafka in JSON OR Public Service Data to Kafka in Avro

It launches a Cloudera Manager command making different API calls to Datagen Web server to generate multiple data using almost all possible models.

Output should be:

It will generate 1 million weather data like this (using the weather model under /opt/cloudera/parcels/DATAGEN/models/public_service/weather-model.json)

{ "city" : "Seysses", "date" : "2021-03-25", "lat" : "43,4981", "long" : "1,3125", "wind_provenance_9_am" : "NORTH", "wind_force_9_am" : "3", "wind_provenance_9_pm" : "WEST", "wind_force_9_pm" : "12", "pressure_9_am" : "1004", "pressure_9_pm" : "1008", "humidity_9_am" : "46", "humidity_9_pm" : "52", "temperature_9_am" : "22", "temperature_9_pm" : "-8", "rain" : "false" }

Let’s Verify

You can make a kafka-console-consumer with enough rights and consume the topic from the beginning to verify production of messages.

But we will instead login to Streams Messaging Manager with a user’s with enough rights and see data:

If you picked th data generation with AVRO format, in Streams Messaging Manager:

If you picked th data generation with AVRO format, you can go to Schema Registry URL (login with a user’s with enough rights) and see the newly added schema:

Finally, if you have SQL Stream Builder installed in your cluster, make sure that user’s ssb & flink have access rights to generated topic, logged to the web console, upload your keytab if necessary and create the table on kafka topic (in JSON):

Then do a sample query to visualize data:

APIs

More than just these pre-defined buttons, Datagen is completely configurable and customizable.

All previous generation were indeed just a bunch opf API calls to Datagen web server.

A user should take advantage of APIs provided by Datagen to run data generation.

Now, we will go through a simple example, but if you want more information on APIs provided, see section on APIs.

First, go to Cloudera Manager > Datagen and click on Datagen Swagger UI :

It will open a new tab to the Swagger of the Datagen Web Server, this swagger will ask to authenticate you with the user/password you passed during the installation. If you did not provide one, it is by default admin as user and admin as password. (You can change this whenever you want in the configuration of Datagen).

Swagger should looks like this:

If you open the data-generation-controller, you should have a lot of endpoints: one per type of service where you want to generate data.

We will use an example the /datagen/hdfs-json endpoint.

Click on it.

Rows, Batches, Threads, Models

If you click on Try it out , you will bea able to fill in all possible parameters.

Do not be scared ! ALL PARAMETERS ARE OPTIONAL and have default values if you do not provide some

All APIs calls for data generation have at least 5 parameters in common:

rows = Number of rows to generate at each batch of data generation
batches = Number of batches to launch (you will end up to have (rows x batches) total rows generated)
threads = To speed up generation, you can multi thread this (by default it is single-threaded), it is recommended to go on 10 threads.
model by specifiying either:
- model_file = file path on the machine where a model is present (for example /opt/cloudera/parcels/DATAGEN/models/public_service/weather-model.json)
- model = upload directly to the swagger your model file from your computer

There are 3 parameters related to kerberos authentication:

kerb_auth = true or false depending on using kerberos or not
kerb_user = kerberos user to log in with to make data generation
kerb_keytab = path to the keytab of this user to login with (must be datagen user’s readable)

By default, all these are set to the datagen’s user.

There are also 2 other parameters that enables you to schedule a launch:

scheduled = true or false
delay_between_executions_seconds = The interval (in seconds) between two executions

All These parameters are discussed further in section on APIs.

Specific configs for HDFS

Each endpoint for a sink has other parameters that allows you to fully override the actual configuration for this service and this will only be in use for this data generation.

In the case of HDFS here, we have:

core_site_path = path to the core-site.xml
hdfs_site_path = path to the hdfs-site.xml
hdfs_uri = hdfs://mynamservice/

Example launch

In this example, we will generate data into HDFS in JSON format, using the swagger and specifiyig some parameters.

In the swagger UI, open hdfs-json endpoint and click on try it out, then do this:

Set batches to 10
Set rows to 1000
Set threads to 10
Set model to /opt/cloudera/parcels/DATAGEN/models/public_service/weather-model.json

You can click on Execute .

Swagger will show you just below what is the equivalent curl request:

curl -X POST "https://ccycloud-1.lisbon.root.hwx.site:4242/datagen/hdfs-json" -H  "accept: */*" -H  "Content-Type: multipart/form-data" -F "batches=10" -F "model=/opt/cloudera/parcels/DATAGEN/models/public_service/weather-model.json" -F "rows=1000" -F "threads=10"

It will also directly answer you and tell you if there are any errors with your model, answer for us is:

{ "commandUuid": "61b9757f-78da-4773-9c5d-a3f154f2b524" , "error": "" }

A command UUID is returned and should be used to check the status of the data generation launched using another API, located in command-runner-controller, called /command/getCommandStatus . This API required the command UUID received and will return status as a JSON like this:

{ "commandUuid": "61b9757f-78da-4773-9c5d-a3f154f2b524" , "status": "FINISHED" , "comment": "" , + "progress": "100.0" ,  "duration": "858ms" }

Let’s Verify

In a shell with a logged in user (optionally use datagen ones):

 hdfs dfs -ls /user/datagen/hdfs/publicservice/weather/
 
Found 10 items
-rw-r--r--   3 datagen datagen     756988 2022-10-14 00:44 /user/datagen/hdfs/publicservice/weather/weather-0000000000.json
-rw-r--r--   3 datagen datagen     756448 2022-10-14 00:44 /user/datagen/hdfs/publicservice/weather/weather-0000000001.json
-rw-r--r--   3 datagen datagen     756374 2022-10-14 00:44 /user/datagen/hdfs/publicservice/weather/weather-0000000002.json
-rw-r--r--   3 datagen datagen     756204 2022-10-14 00:44 /user/datagen/hdfs/publicservice/weather/weather-0000000003.json
-rw-r--r--   3 datagen datagen     756878 2022-10-14 00:44 /user/datagen/hdfs/publicservice/weather/weather-0000000004.json
-rw-r--r--   3 datagen datagen     756132 2022-10-14 00:44 /user/datagen/hdfs/publicservice/weather/weather-0000000005.json
-rw-r--r--   3 datagen datagen     756812 2022-10-14 00:44 /user/datagen/hdfs/publicservice/weather/weather-0000000006.json
-rw-r--r--   3 datagen datagen     757160 2022-10-14 00:44 /user/datagen/hdfs/publicservice/weather/weather-0000000007.json
-rw-r--r--   3 datagen datagen     756216 2022-10-14 00:44 /user/datagen/hdfs/publicservice/weather/weather-0000000008.json
-rw-r--r--   3 datagen datagen     756000 2022-10-14 00:44 /user/datagen/hdfs/publicservice/weather/weather-0000000009.json

 hdfs dfs -cat /user/datagen/hdfs/publicservice/weather/weather-0000000000.json

{ "city" : "Beauchamp", "date" : "2017-03-23", "lat" : "49,0139", "long" : "2,19", "wind_provenance_9_am" : "EAST", "wind_force_9_am" : "98", "wind_provenance_9_pm" : "WEST", "wind_force_9_pm" : "5", "pressure_9_am" : "1003", "pressure_9_pm" : "1009", "humidity_9_am" : "25", "humidity_9_pm" : "35", "temperature_9_am" : "30", "temperature_9_pm" : "9", "rain" : "false" }
{ "city" : "La Garnache", "date" : "2016-02-19", "lat" : "46,8906", "long" : "-1,8311", "wind_provenance_9_am" : "WEST", "wind_force_9_am" : "49", "wind_provenance_9_pm" : "NORTH", "wind_force_9_pm" : "75", "pressure_9_am" : "1014", "pressure_9_pm" : "1002", "humidity_9_am" : "30", "humidity_9_pm" : "8", "temperature_9_am" : "5", "temperature_9_pm" : "33", "rain" : "false" }
{ "city" : "Escoublac", "date" : "2018-04-10", "lat" : "47,2858", "long" : "-2,3922", "wind_provenance_9_am" : "NORTH", "wind_force_9_am" : "62", "wind_provenance_9_pm" : "NORTH", "wind_force_9_pm" : "111", "pressure_9_am" : "1019", "pressure_9_pm" : "1013", "humidity_9_am" : "77", "humidity_9_pm" : "56", "temperature_9_am" : "-1", "temperature_9_pm" : "16", "rain" : "true" }
{ "city" : "Anse", "date" : "2019-06-13", "lat" : "45,9356", "long" : "4,7194", "wind_provenance_9_am" : "WEST", "wind_force_9_am" : "96", "wind_provenance_9_pm" : "EAST", "wind_force_9_pm" : "114", "pressure_9_am" : "1005", "pressure_9_pm" : "1009", "humidity_9_am" : "44", "humidity_9_pm" : "46", "temperature_9_am" : "3", "temperature_9_pm" : "21", "rain" : "false" }
{ "city" : "Dammarie-le-Lys", "date" : "2015-07-24", "lat" : "48,5177", "long" : "2,6402", "wind_provenance_9_am" : "EAST", "wind_force_9_am" : "120", "wind_provenance_9_pm" : "WEST", "wind_force_9_pm" : "87", "pressure_9_am" : "1006", "pressure_9_pm" : "1012", "humidity_9_am" : "68", "humidity_9_pm" : "21", "temperature_9_am" : "-9", "temperature_9_pm" : "-1", "rain" : "false" }
{ "city" : "Saint-Remy-les-Chevreuse", "date" : "2021-10-05", "lat" : "48,7058", "long" : "2,0719", "wind_provenance_9_am" : "SOUTH", "wind_force_9_am" : "62", "wind_provenance_9_pm" : "WEST", "wind_force_9_pm" : "71", "pressure_9_am" : "1017", "pressure_9_pm" : "1007", "humidity_9_am" : "70", "humidity_9_pm" : "95", "temperature_9_am" : "33", "temperature_9_pm" : "-7", "rain" : "true" }
{ "city" : "Vouneuil-sous-Biard", "date" : "2020-07-06", "lat" : "46,5731", "long" : "0,2714", "wind_provenance_9_am" : "EAST", "wind_force_9_am" : "11", "wind_provenance_9_pm" : "EAST", "wind_force_9_pm" : "21", "pressure_9_am" : "1019", "pressure_9_pm" : "1017", "humidity_9_am" : "55", "humidity_9_pm" : "20", "temperature_9_am" : "5", "temperature_9_pm" : "23", "rain" : "false" }
{ "city" : "Tourves", "date" : "2016-05-07", "lat" : "43,4081", "long" : "5,9239", "wind_provenance_9_am" : "NORTH", "wind_force_9_am" : "10", "wind_provenance_9_pm" : "WEST", "wind_force_9_pm" : "39", "pressure_9_am" : "1019", "pressure_9_pm" : "1000", "humidity_9_am" : "60", "humidity_9_pm" : "93", "temperature_9_am" : "0", "temperature_9_pm" : "29", "rain" : "false" }

It’s over for basic data generation, now you can move on to create your own custom model or if you prefer, you can go and play with APIs