Building Data Pipelines in Azure Cloud
How data pipelines are used in Azure to analyze machine data
Fresenius Medical Care
Through the modular construction of the cloud architecture.
Faster release cycles
Modern Secrets Management
In a Nutshell:
- SECTOR: HEALTHCARE
- Task: This customer project was about making machine data concerning treatments available for analytics and machine learning.
- About 8 people in total, mainly Data Scientists
- Engineers who develop the machines and use the insights
- Project duration: +12 months
- Dialysis machines collect data locally in the clinic and the issue is their analysis
- Complex parsers are needed to decrypt machine data
- Build Azure data pipelines to consolidate important data
- Merging of different data sources so that further KPIs can be calculated
- Data pipelines should be idempotent and modular
- Scaling: Pipelines should be designed for increasing data volumes
- Privacy: patient data must be handled with the utmost care
- Using Azure Functions to containerize parsers with Docker
- Upload various data sources to Azure Data Lake Gen2 (parquet)
- Using Azure Synapse Analytics to query the data with SQL
- Auto-scaling through the use of Azure Functions
- Modular, simple and scalable pipelines in Azure Data Factory
- Pipeline orchestration through Azure Data Factory
- Structure of different data layers in the Azure Data Lake (Raw, Staging, Core, Presentation)
- Fully automated provisioning of the complete infrastructure with Infrastructure as Code (Pulumi)
- Greater flexibility due to the modular structure of the cloud architecture
- Scalability through Azure Functions and auto-scaling in the cloud
- Cost-effective storage of data in Azure Data Lake Gen2
- Robustness through the use of Infrastructure as Code (Pulumi)
- Faster release cycles through the use of CICD and GitHub Actions pipelines
- Modern Secret Management through the use of Key Vaults and Infrastructure as Code
- Faster analysis through automated provision of data is possible
- Secure and encrypted data storage in the cloud
The goal of the project was to collect data from various local machines, transfer them to the cloud and finally analyze them. For this purpose, technologies such as Azure Data Lake, Azure Data Factory, Azure Synapse and custom Python code were used. It was important for the customer to pay attention to the privacy of the data and to design the data pipelines securely. Other requirements were the complex data formats from the machines, for which custom Python code was written and a parser developed. Another requirement of the customer was to make the Azure infrastructure scalable in order to be prepared for the increasing data volumes in the future.
Adjustment and construction of the pipelines
The pipelines were primarily developed in Azure Data Factory. The Azure Data Lake Gen2 was used as the data source, and this was regularly expanded with new data from local machines. For the further processing and reading of the data with a custom C-parser, various Azure Functions were used. Various Data Scientists took over the further processing of the data and worked primarily in Azure Databricks. To read out all the data, Azure Synapse was used, accessing both the data in the data lake and the data in Azure Data Bricks. By building different release and deployment pipelines, the average delivery time for new features was significantly reduced. The use of pipelines was modularized so that different sub-teams used them individually depending on the specific application.
Robust, modular cloud infrastructure and data security
Another important requirement was to make the cloud infrastructure scalable. By using Infrastructure as Code, the central overview of the cloud resources and infrastructure was clearly defined. Another benefit of this was the improvement of developer workflows and integration with various DevOps processes. The entire infrastructure was built to scale with the growing demands of data volumes. By storing the data in Azure Data Lake and the secure data transfer, the data security of the customers could always be guaranteed. Data from the USA is always hosted in USA regions and data from Europe in European regions.
Project status and results
The customer is very happy with the choice of Azure as the cloud infrastructure. The project continues to run today and scales beautifully with the increasing data volumes. By optimizing various DevOps processes and building the pipelines, all sub-teams were significantly faster in testing, releasing and deploying the software. Thus, internal departments were able to see results faster and agilely adapt the customer requirements in the sprints. Using Infrastructure as Code, a robust and module cloud infrastructure could be created. By creating dynamics in pipelines, different in-house teams of the customers can access the analyzed data and create further analyses.
Azure Synapse Analytics
- Azure Synapse Analytics
- Azure Blob Storage
- Azure Data Factory
CICD & IaC:
- GitHub Actions
- Azure Key Vault
- Azure Resource Groups
- Infrastructure as Code – Pulumi
Why choose Pexon Consulting?
Pexon Consulting is fully committed to your success and we believe in always going the extra mile for each of our clients:
Commitment to success
Focus on performance
Engineering with passion
Your contact persons
Send us a message using the contact form on our contact page and we will respond within a few business days. All information submitted will be treated confidentially.
Are you looking for a partner for your project?
We will do our best to make you satisfied.