Internet Of Plants

Part 1 - Data Provisioning

Intro

Depending on the size of your organisation many different data roles are covered by different persona. In small to medium business there is typically that one IT guy where everybody goes to for Excel questions, updating the SharePoint, configuring security networks, or connecting to SQL databases. When organisations grow, the complexity of their IT (and OT) infrastructure grows with it. A data scientist needs to deliver actionable insights from the available data to help the business forward. The first thing to do before starting a data science project is exploring whether the data is of sufficient quality to get the job done.

As a curious data scientist I want to broaden my knowledge about how to provision new time-series data and better understand its challenges. I decided to buy some hardware and configure the necessary cloud infrastructure to stream the data in (near) real-time to my devices. How hard could it be!? Haha, well … spoiler alert, it was harder than I expected. In the next few sections I will share what I did and which lessons I learned. For those interested, the code repository is also shared at my GitHub.

Problem Statement

First I needed to have problem I want to solve. At home, I don’t have so many interesting production processes that need to be optimised. Nevertheless there are some candidates that require my attention. I picked one of my lovely indoor plants, Pilea peperomioides, aka Pancake plant as my research object.

Experimental set-up to monitor CO₂ fixation in Pilea peperomioides

Let’s say I want to monitor how well my specimen grows under certain conditions. My key performance indicator (KPI) or target variable is rate of CO₂ fixation. Where maximum plant growth = maximum plant hapiness = maximum $\frac{-\partial CO_2}{\partial t}$.

With my bio-engineering background I know that autotrophic organisms like plants use sunlight to convert CO₂ into sugars when exposed to light through photosynthesis (Calvin Cycle for C3 plants). However, when darkness falls, our leafy friends don't simply die. Instead, they burn those hard-earned sugars using oxygen, converting it into CO₂ through dark respiration (Glycolysis and Krebs Cycle). I won’t go into much more detail because that would bring us too far. Also the purpose of this experiment is collecting data and not so much doing research about plants.

Hypotheses

Ok, that being said. Let’s formulate the following hypotheses that we want to test:

  • CO₂ concentration should be different during the night vs during the day.

  • Reactions go faster with increasing temperatures.

Experimental set-up

In contrast to industrial automation systems, which connect sensors using robust systems like PLC’s or DCS’s designed for high-speed data transfer, a microcontroller setup offers a more flexible and cost-effective approach. For less than 30 euros I bought myself the following hardware at Amazon to get me started.

After wiring up the different sensors, each data point needs to be sent to a messaging broker that connects to cloud storage. Now the complexity kicks in. Different communication protocols are used to ensure the data is readily available in the cloud, with specific requirements at each stage of the data pipeline. I2C is used for connecting sensors to the microcontroller for the ability to manage multiple devices on a single bus. Once data is collected on the controller, MQTT protocol is employed for efficient message delivery to the cloud via my WIFI router, where low bandwidth and latency are crucial. Next a built in Kafka connector is used to for handling real-time data streams with high throughput and fault tolerance. Finally, the data is stored in Parquet format for its efficiency in managing large datasets, offering effective compression and faster querying capabilities. Each of these protocols and formats plays a vital role in ensuring smooth data flow from field to cloud, tailored for the specific needs of the application.

Architecture overview for using MQTT and KAFKA streaming at EMQX platform.

The following software I used to build the required data pipelines for about 3 euro/day.

  • Open Source PlatformIO (VS Code Extension) to push the script from my VS-Code IDE to my controller

  • Free license EMQX Cloud MQTT broker

  • (Pay as you go) Azure Portal Event Hub and Streaming Analytics

  • (Pay as you go) Databricks

Initially the cost was a lot higher, mainly for storage because of some default settings. Special thanks to my friend, Jorn Beyers, who gave me advice in setting up secrets for Databricks and helped me out with configurations in Azure Portal. I might write a separate post about how the cost could be optimised, because I think it’s worth to dive in a bit deeper.

Results

For me the most difficult part was debugging the code without having any experience in each of these steps. I won’t go through the entire process but I’ll briefly explain its key features in pseudo code.

  • Connect to WIFI

  • Connect to MQTT broker

  • Get synchronised timestamps with external FTP server in UTC

  • Read out sensor values (temperature, humidity and CO₂) and send a JSON string as payload to EMQX

I like EMQX as a messaging broker since it offers a lot of features out of the box and comes with 1GB free monthly quota. The cloud interface and documentation was also very user-friendly. Once I got the data in there I only transmit the json message to my Azure instance and do further data processing there. Alternatively, you could write SQL queries in EMQX for transformations, but I found it much easier to configure the data format right at its source.

Azure Event Hubs and Stream Analytics offer powerful capabilities for handling real-time data streams. These services allow you to ingest and process data at various speeds, from real-time to batch, depending on your specific needs. In my setup, I partitioned the data into hourly chunks, keeping a balance between near real-time processing and resource efficiency. This approach helped me optimise storage costs and query performance while reducing system load.

Azure Streaming Analytics configuration in no-code editor

Finally I have my data ready for consumption and now the fun part can begin! Now the data can be pulled into Databricks using pandas or spark and to build streaming data pipelines using delta live tables, create machine learning models, deploy apps, and understand under which conditions my pancake plant is the happiest.

In my next blog post I’ll explain how I would analyse this data in more depth and share my insights from this DIY plant monitoring experiment. I’ll offer recommendations to optimise the target function, test my formulated hypotheses and discuss the limitations of my current set-up .

Thanks a lot for sticking with me to the end! I’m eager to hear your thoughts. Drop me a comment on LinkedIn to discuss how data science can transform your business challenges into competitive advantages.

Previous
Previous

Internet Of Plants