Building a Data Lake for Modern IT Teams

"Data lake for IT teams? That's unheard of!"

This is a common reaction when the concept of a data lake is brought up in the context of IT teams.

Many assume IT teams only manage systems and infrastructure, not massive, complex datasets that require something like a data lake. But that assumption no longer holds. The rise of SaaS in organizations has created data from multiple sources: user activity, access logs, operational metrics, and more.

The challenge isn’t just storing this data. It’s understanding it, extracting actionable insights, and using it to secure, optimize, and govern your SaaS environment.

TL;DR

The explosion of SaaS apps has created massive, scattered datasets—making IT teams’ jobs less about storage and more about extracting insights for security, optimization, and governance.
A data lake unifies these diverse data streams, giving IT teams a single view of user activity, access, and app usage that siloed tools can’t provide.
Spreadsheet-based “data lakes” are quick and cheap to set up but struggle with scale, complexity, and data accuracy.
In-house data lakes offer full control and advanced analytics but require major investment in infrastructure, engineering, and ongoing maintenance.
Choosing the right approach depends on factors like SaaS complexity, data volume, compliance needs, available IT resources, and budget.
Tools like Stitchflow give IT teams a ready-made, centralized view of all SaaS apps—tracking usage, monitoring access, and providing actionable insights—without the time and cost of building and maintaining a data lake.

But what exactly is a data lake?

Technically, a data lake is “a system or repository of data stored in its raw format.” That’s the Wikipedia definition. Useful, but it misses the bigger picture—especially for organizations running multiple SaaS applications.

Data lakes in the SaaS-first IT environment are more than storage

In the context of SaaS applications, a data lake acts as a single, unified repository for all the data these tools generate. Each app contributes a different type of data, like streams feeding into a larger lake.

Each app contributes a unique “data signature”

Google Groups: Shows which groups a user belongs to
Slack: Reveals which channels users participate in
Device Management Tools: Highlight outdated software on devices

A data lake brings together these diverse streams, breaking down silos that normally keep information trapped in individual apps. This unified view of activity and interactions across your organization gives IT teams the insights they need to make smarter, faster decisions.

Why should IT teams care about data lakes?

Modern IT environments are drowning in data from dozens (or even hundreds) of SaaS applications. Data lakes offer more than just consolidation—they give IT teams strategic control, turning scattered data streams into a cohesive intelligence platform that supports proactive, informed decision-making.

Centralized and consistent data management

A data lake brings all your data into one place, but its value goes beyond just storage. By maintaining data lineage, IT teams can trace information back to its source, verify accuracy, and eliminate the time-consuming process of reconciling conflicting reports. The result? Decisions based on reliable, authoritative data—and less firefighting when numbers don’t add up.

Scalability and flexibility

Unlike traditional warehouses that need rigid schemas, data lakes use a schema-on-read approach, letting IT teams adapt as SaaS ecosystems evolve. Adding new tools—whether project management platforms, security solutions, or emerging apps—is straightforward. And it’s not just about structured data: logs, unstructured files, and real-time streams all fit well. The architecture grows with the organization, not against it.

Better analytics

With consolidated data, IT teams can uncover insights that would be impossible in siloed systems. For example, you can see how application performance impacts user satisfaction or spot patterns linking security incidents to specific workflows.

Layer in advanced analytics and machine learning, and your data lake can predict failures, flag anomalies, and optimize resource allocation across your SaaS environment.

Strategic resource optimization

Data lakes cut costs by eliminating redundant storage, but the real win is time. Instead of spending hours aggregating data for reports, IT teams can focus on automation, security improvements, and digital transformation projects.

Over time, the savings compound: fewer duplicate licenses, more consistent metrics, and a streamlined operational footprint.

👉Order a free on-demand audit service from Stitchflow—like a license cleanup, access review, or offboarding sweep—and get a glimpse of how much time and waste you can eliminate with a streamlined SaaS stack.

Security and governance

Centralized data management turns security from a reactive patchwork into a proactive strategy. IT teams can enforce uniform access controls, maintain audit trails, and ensure compliance across all SaaS tools.

And because data lakes are flexible, they adapt to new analytics tools, evolving regulations, and changing business needs, ensuring your data architecture is resilient and avoids becoming tomorrow’s technical debt.

📚Also read: Why is this single pane of glass for IT teams so damn hard to build?

How can IT teams build a data lake?

Creating a data lake for IT teams, particularly for merging data from different SaaS applications, can vary in approach based on the available tools and resources.

The complexity of the solution can range from simple, easily accessible methods like using Google Sheets to more sophisticated, resource-intensive strategies like developing an in-house solution.

*The complexity of building an in-house data lake vs. a spreadsheet-based data lake*

Low complexity: Setting up a spreadsheet system

For smaller IT environments or teams just starting to centralize SaaS data, a spreadsheet-based approach using Google Sheets or Excel can act as a simple “data lake.” This method relies on basic functions to import, process, and analyze data from various sources.

How to create a spreadsheet-based IT data lake

Identifying your data sources: Determine which SaaS apps to pull from, such as Google Workspace, Slack, Kandji, or Jamf
Setting up your spreadsheet: Create a new document and organize it with separate sheets for different data types or sources
Automating data imports: Use functions like IMPORTDATA, IMPORTRANGE, IMPORTHTML, or IMPORTXML, or leverage Google Apps Script for API connections
Cleaning and formatting data: Standardize dates, text entries, and remove duplicates to ensure consistency
Integrating data: Combine information from multiple sheets using functions like VLOOKUP or QUERY to align data by common identifiers
Scheduling updates: Refresh your imports regularly to keep your “data lake” current
Implementing basic security: Control access with spreadsheet sharing settings to protect sensitive data
Analyzing your data: Use pivot tables, charts, or connect to BI tools for deeper insights
Maintaining the data lake: Regularly check imports, update scripts, and perform cleanup to keep the repository accurate and efficient

While the manual, spreadsheet approach is low complexity and requires minimal technical overhead, it works best for smaller-scale environments. For larger organizations or those with high-volume, diverse SaaS data, more robust solutions are typically necessary.

Pros	Cons
Google Sheets is user-friendly and widely accessible, requiring no special training	Struggles to handle large volumes of data or complex data types
It provides a cost-efficient way to start with data consolidation without additional investment	Requires considerable manual input, which can lead to errors and inefficiencies
Quick to set up, allowing teams to start integrating data with minimal delay	Offers limited capabilities for advanced data analysis and processing

High complexity: developing an in-house solution

For larger organizations or those with unique requirements, building an in-house data lake can provide a fully customized solution—but it’s a complex, resource-intensive approach. It typically requires a team of developers and data engineers to design, implement, and maintain the system.

How to build an in-house IT data lake

Define requirements and scope: Identify what the data lake should achieve, what types of data it will store, and how the data will be used. This step sets the direction for the entire project
Choose a technology stack: Decide on platforms like Hadoop, Amazon S3, Microsoft Azure, or Google Cloud, considering scalability, performance, cost, and compatibility
Design the architecture: Plan how data will be stored, processed, and accessed, including storage formats, database types, and overall infrastructure design
Set up storage infrastructure: Implement physical or cloud-based storage that is both scalable and secure
Develop data ingestion processes: Create pipelines to pull data from SaaS apps and other sources via APIs, webhooks, or ETL workflows
Implement data processing and transformation tools: Use tools like Apache Spark or Apache Flink to process and transform data as needed
Ensure data quality and consistency: Apply cleaning, deduplication, and validation to maintain reliable, accurate data.
Set up governance and compliance: Establish policies to ensure security and regulatory compliance (e.g., GDPR, HIPAA)
Integrate analytics and BI tools: Connect the lake to BI platforms to generate insights, reports, and visualizations
Implement robust security measures: Use encryption, access controls, and network security protocols to safeguard the data
Test and optimize: Validate functionality, performance, and cost-efficiency, making adjustments as needed
Train your team: Ensure staff are confident in using and maintaining the system
Deploy and monitor: Roll out the data lake and continuously track performance, usage, and data integrity

While this approach offers full control and customization, it comes with higher costs, longer timelines, and ongoing maintenance demands—making it best suited for organizations with specialized data requirements and dedicated IT resources.

📚Also read: Why don't existing IT tools help with visibility?

Pros	Cons
It can be specifically tailored to fit the unique requirements and workflows of the organization.	Involves significant investment in terms of time, money, and skilled personnel.
Better equipped to handle large volumes of diverse data.	Requires a high level of technical expertise to develop and manage.
Allows for more complex data processing and advanced analytics.	Development and implementation can take a considerable amount of time.

Spreadsheet or custom solution: Choosing the right approach

Not every IT team needs (or can support) a full-scale data lake. The right approach depends on your environment, resources, and goals—whether that means starting simple with spreadsheets, investing in a custom build, or finding a smarter middle ground.

Dimension	What “Yes” Looks Like
SaaS complexity	You have multiple SaaS applications, and data across them is currently siloed or hard to integrate.
Data volume and variety	Your organization generates high volumes of structured, semi-structured, or unstructured data that requires centralized management.
Analytics needs	You need advanced insights—cross-app reporting, predictive analytics, or real-time dashboards.
IT talent and resources	Your team has developers, data engineers, or IT staff capable of building and maintaining a custom solution.
Security and compliance	You must enforce strict access controls, audit trails, and regulatory compliance across data sources.
Scalability requirements	Your data storage and processing needs are expected to grow rapidly.
Automation and efficiency	You want to reduce manual work, streamline workflows, and improve operational efficiency.
Executive support	Leadership understands the investment, expected ROI, and supports long-term adoption.
Budget and cost Flexibility	You have the financial resources to support a custom solution, including ongoing maintenance.

Answer “Yes” only if your team can confidently meet the requirement today or in the very near term. Answer “No” if significant gaps exist that would prevent you from achieving the outcome without additional resources, tools, or support.

Interpreting your answers

If you answer “No” on two or more dimensions, a spreadsheet-based solution may be a better starting point while you build internal capability.
If you answer “Yes” on most dimensions, you are in a strong position to custom-build a data lake that fits your organization’s specific needs.

If your answers fall in the middle, a SaaS management platform can be a smart choice. Tools like Stitchflow give IT teams a ready-made, centralized view of all your SaaS apps—tracking usage, monitoring access, and providing actionable insights—without the time and cost of building and maintaining your own data lake.

📚Also read: How to take a Data-First Approach to Corporate IT Tool Sprawl

Why modern IT teams use Stitchflow as their SaaS data lake

For spreadsheets, you’re stitching together imports and formulas. For an in-house build, you’re investing in pipelines, storage, and engineers.

Stitchflow takes a different path: it connects directly to your apps and identity systems, pulls in data through APIs, CSVs, and other methods, and continuously reconciles everything into one unified graph. No setup scripts, no schema design, no infrastructure to maintain—just a system that’s ready to run from day one.

From there, you get the same advantages you’d expect from a data lake, but without the overhead:

Unified ingestion: Pulls data from every type of app: API-based, non-SSO, CSV-only, and even newer AI tools with no integration options.
Queryable source of truth: The IT Graph reconciles users, roles, and usage across apps and domains so you can ask clear questions and get reliable answers
Always current: Data refreshes continuously, so you’re not working off stale exports or scrambling before an audit
Business context: Access is tied back to teams, departments, and roles, so the data actually reflects how your company operates
From insight to action: You can fix issues right away with one-click cleanup, automated tickets, or quick license reviews in Slack

Think of Stitchflow as the data lake IT teams actually need: not just storage, but a living system that helps you see everything, trust the data, and act on it.

Book a demo to see how Stitchflow delivers the visibility today’s IT environments demand—holistic, actionable, and effortless to maintain. Or get a done-for-you report on user access review, SaaS license usage, or even shadow IT—your first report is free.

Frequently asked questions

A data warehouse is structured and optimized for reporting—you define the schema upfront, and it works well for consistent, repeatable queries. A data lake is more flexible: it can hold raw, semi-structured, and unstructured data, which makes it better for IT teams dealing with logs, access data, and diverse SaaS outputs that don’t fit neatly into a rigid schema.

Start by limiting who has access—role-based controls are key. Make sure data lineage is clear so you can always trace information back to its source, and log all activity for auditability. Encrypt sensitive data at rest and in transit, and align retention policies with compliance requirements like GDPR or HIPAA.

The biggest one is treating it like a dumping ground—loading data without a plan for how it’ll be used. Teams also underestimate the ongoing work: keeping pipelines current, cleaning data, and managing schema drift. Finally, skipping governance early on almost always leads to messy, unreliable data later.

Sanjeev NC

Head of Marketing, Stitchflow

Sanjeev NC started his career in IT service desk and moved to ITSM process consulting, where he has led award-winning ITSM tool implementations. Sanjeev was also a highly commended finalist for Young ITSM Professional of the Year in itSMF UK’s annual awards.