Tutorial: Creating and Running a JSON Cleaning Workflow¶
Objective¶
This workflow will read a JSON file, remove any fields with null
values, and save the cleaned JSON data to a new file.
Step 1: Create the Python Steps¶
-
Create a Python file to define the steps. In this example, we'll name it
steps.py
. -
Add the necessary imports:
steps
is a decorator to register the step classes.Step
is the base class for all steps.
-
Implement the
LoadJSONStep
to load the JSON file and store it in the workflow context.@steps.register(name='load-json') class LoadJSONStep(Step): """ Step that loads JSON data from a file and stores it in the context. """ def perform(self, input_json: str) -> None: with open(input_json, 'r') as infile: data = json.load(infile) self.workflow_context.set("json_data", data) self.logger.info(f'JSON data loaded from {input_json}')
-
Implement the
RemoveNullsStep
to recursively removenull
fields from the JSON data stored in the context.@steps.register(name='remove-nulls') class RemoveNullsStep(Step): """ Step that removes all fields with null values from JSON data in the context. """ def remove_nulls(self, data): if isinstance(data, dict): return {key: self.remove_nulls(value) for key, value in data.items() if value is not None} elif isinstance(data, list): return [self.remove_nulls(item) for item in data] else: return data def perform(self) -> None: data = self.workflow_context.get("json_data") cleaned_data = self.remove_nulls(data) self.workflow_context.set("cleaned_json_data", cleaned_data) self.logger.info('Null fields removed from JSON data')
-
Implement the
SaveJSONStep
to save the cleaned JSON data back to a file.@steps.register(name='save-json') class SaveJSONStep(Step): """ Step that saves JSON data from the context to a file. """ def perform(self, output_json: str) -> None: data = self.workflow_context.get("cleaned_json_data") with open(output_json, 'w') as outfile: json.dump(data, outfile, indent=4) self.logger.info(f'Cleaned JSON data saved to {output_json}')
-
Save
steps.py
in a known location.
Step 2: Create the Workflow Configuration File¶
-
Create a YAML or JSON file to configure the workflow. Name this file
workflows.yaml
orworkflows.json
. -
Define the Parameters:
input_json
: Path to the input JSON file.output_json
: Path where the cleaned JSON will be saved.
-
Define the Workflow to include each step in the correct sequence.
workflows.json{ "parameters": [ { "name": "input_json", "value": "<project_path>/data/input.json" }, { "name": "output_json", "value": "<project_path>/data/output.json" } ], "workflows": { "clean-json": { "steps": [ { "name": "Load JSON", "step": "load-json" }, { "name": "Remove null values", "step": "remove-nulls" }, { "name": "Save JSON", "step": "save-json" } ] } } }
Replace
<project_path>
with the actual path to your project directory. -
Save
workflows.yaml
orworkflows.json
in the same directory assteps.py
or a known path.
Step 3: Prepare a Sample JSON File¶
Create a sample JSON file with some fields set to null
. Save it as input.json
in the specified location
(<project_path>/data/input.json
).
Example input.json
content:
{
"name": "Alice",
"age": 30,
"address": {
"street": "123 Main St",
"city": null,
"state": "CA",
"postal_code": null
},
"contacts": [
{"type": "email", "value": "alice@example.com"},
{"type": "phone", "value": null}
],
"preferences": {
"newsletter": true,
"sms_alerts": null
},
"membership": null
}
Step 4: Run the Workflow¶
-
Open a terminal and navigate to the directory where you saved
steps.py
, andworkflows.yaml
orworkflows.json
. -
Run the Workflow with
workflows-manager
, specifying the imports and configuration file paths:- If you'd like more detailed logging output, add
--log-level debug
. - If you want to save logs to a file, use
--log-file /path/to/logfile.log
.
Override parameters from the command line
You can override the parameters defined in the configuration file by adding them to the command line. For example:
- If you'd like more detailed logging output, add
-
Verify Output:
- After the workflow completes, check
<project_path>/data/output.json
. - The cleaned JSON file should have all
null
fields removed.
- After the workflow completes, check
Summary¶
You’ve now created a simple JSON cleaning workflow using workflows-manager
! Here’s a recap of the steps:
- Define Python steps for loading, cleaning, and saving JSON.
- Configure the workflow in either
workflows.yaml
orworkflows.json
file. - Prepare an input JSON file with
null
fields. - Run the workflow from the command line to generate the cleaned JSON file.
This setup is easily extensible for more complex data processing workflows.