September 10, 2024

Making Data Accessible

Author:
Valeria Andreolli

Trust is an essential component when dealing with the acceptance of technology.

Since the mechanisms behind technology and artificial intelligence are only sometimes clear to users, being transparent about the data and the processes used is fundamental to building and maintaining the users’ trust.

With AI, transparency means offering clear explanations about how artificial intelligence systems operate, why they produce specific results, and what data they use. Only in this way is it possible to build users’ trust.

As part of our Social Impact Tech series, we interviewed Peter Yeuan-Chen Huang, Data Engineer at Data Friendly Space (DFS). Specializing in public cloud environments, he is passionate about gathering and finding valuable information hidden within complex types of data, which can ultimately improve decision-making and insights.

We discussed with him the fundamental principles before beginning any data collection process, the best tools to derive meaningful insights from complex datasets, and the steps to make technology transparent and explainable to users and stakeholders.

What inspired you to start working in technology?

While I was not studying computer science or information technology as my major, I was still involved in technology — specifically, medical technology. The mindset for medical technology was undoubtedly similar and closely related to information technology, especially data engineering. Good data processing and data interpretation are essential for producing a meaningful and reasonable report of any medical experiment.

My journey into data engineering began with an advanced medical technology project, which sparked my interest in how data and technology can revolutionize healthcare. I transitioned from medical tech to IT and data engineering because I saw the potential of technology to solve complex problems in healthcare, realized the broader applications of data across various industries, and wanted to bridge the gap between cutting-edge technology and practical, real-world solutions.

What are your favorite tools/technologies?

As a data engineer, I have extensive experience working with a variety of tools and technologies. My top picks for each scenario are:

  1. For data processing, there is no debate: Python and SQL are the way to go! These are my everyday essentials. I cannot do my work without them. They are simple enough that it is like speaking a second language to the computer.
  2. When it comes to data storage, I use an S3 bucket for simple storage because it is what you would expect from its name, simple. SQLite is the perfect choice for a lightweight database, and that’s exactly what you get, Lite! For more advanced needs, I would recommend PostgreSQL, MongoDB, or BigQuery. As the world becomes more digital, data volumes are growing larger and more complex. Each storage solution has its pros and cons. It is crucial to use the right tool for the right purpose rather than relying on a single solution or trying to force data into something it cannot be.
  3. In terms of visualizing data, if you do not have a dedicated engineering team to build custom charts for you, Tableau, Looker Studio, and Power BI are all excellent tools. Apache Superset gets a favorable mention as an open-source tool, it is powerful and simple enough for real-world projects.
  4. Data Infrastructure: The public cloud platform is a total game-changer. AWS, Google Cloud, or Azure are all viable options for scaling data processing and provisioning capabilities. They can handle an enormous amount of data in a short amount of time without too much preparatory work.
  5. Last but not least, LLM Models undoubtedly improve workflow. Chatbots like Claude/ChatGPT for answering general questions and combined within Cursor to provide code assistance can be leveraged to increase workflow efficiency.

These tools are invaluable and the professionals who created and maintain them deserve heaps of recognition for these game-changing applications. They allow us to derive meaningful insights from complex datasets and real-world challenges.

What are some of the biggest challenges you face when integrating technology into sectors that are still catching up?

Change can be challenging for many organizations and industries, making it understandable that they are hesitant to adopt new technologies. It can be scary and difficult to understand new things.

From my previous experience as a cloud solutions architect and as a data engineer working with various industries, I’ve learned that some key challenges include:

  1. Legacy systems: Integrating modern technology solutions with outdated infrastructure or management can be tricky and often requires a lot of resources to migrate. In some cases, it’s simply not possible to integrate with the legacy system, which requires even more resources to refactor.
  2. Skill gap: The lack of skilled employees within the organization will be a problem for later maintenance, as they cannot outsource integration forever.
  3. Regulatory compliance: New technology must adhere to strict regulations, especially in sectors like healthcare, government, and finance.

What should everyone have in mind when handling data collection and data quality?

Before beginning any data collection process, it is important to follow a set of fundamental principles. While the exact phrasing can shift on different organizational needs, the principles in theory should stay the same or very similar.

  1. Why are we collecting this data? We might always seek more data, for having backups or thinking this would be useful in the future, but we need to take a step back and ask ourselves: do we really need them? Storing and processing data is expensive in terms of both computational resources and time. Without a purpose, there is no point in wasting valuable resources on it.
  2. Are we allowed to collect this data? Not every piece of data that you can access means that you have the right to store it or even use it later. Organizations need to ensure that they are complying with data protection and privacy laws which can get tricky when tackling global issues.
  3. What information does this data provide? Determine whether it is publicly available or contains sensitive information. We must take responsibility for the data we collect. It is crucial that we have the capacity to properly handle and secure this data and not allow it to compromise individuals’ privacy.
  4. How will we be using this data? We must document every decision regarding how this data is, or will be, processed. This will maintain clear metadata and data lineage information, which is critical for ensuring data quality.
  5. Do we understand what the data contains? It is crucial that we fully comprehend the data we are working with. To guarantee data quality, we must understand the data and set up the appropriate checkpoints and validation processes. Poor data quality can have severe ethical implications, such as biased or inaccurately collected data, which can lead to undesirable impacts.

What steps are you taking to make technology transparent and explainable to users and stakeholders?

In my view, the aim of transparency and explanation is not to offer the most open or detailed explanation possible. Instead, it is to address the issue of new technology being distrusted due to a lack of background context and knowledge. By providing comprehensive documentation, using self-explanatory terms, and communicating in clear, non-technical language with examples and visuals, we can significantly reduce the fear of the unknown.

Let’s take GANNET, our humanitarian AI tool, for example. It was designed with very easy-to-understand terms. We are transparent about our approach and have even directly embedded a data explorer so users can see exactly what is being used when the LLM generates responses to their questions. This eliminates any guesswork about where the knowledge comes from. Our team hosts demo workshops to help users and stakeholders understand how this platform was built and how to get the best out of it. We provided real-world usage examples from our internal analysts to show how to use the platform to its fullest potential.

Another example within a different project is where we needed to incorporate multiple data sources containing different types of data that look similar but are very different in detail, causing the solution to be quite complex. To ensure that the user and stakeholders are on the same page as we are, we document all the details together and make it very straightforward to see the difference in the data. We provide visualizations whenever possible, along with examples of what we are working on, which decreases the information and knowledge gap between us.

When it comes to AL/ML, especially LLM-related technology, terminology can differ between platforms or vendors due to its fast iteration and relatively new innovating domain. This can cause misalignment in communication between end-users and developers. Pre-defining terms are essential to align everyone.

How do you envision the role of AI in tackling global challenges such as climate change?

Regarding “global challenges” from the perspective of a data engineer, the data related to these challenges often share common properties, with massive volumes and differing standards between countries. This can include variations in units, language, and even data formats within the same sector. It is difficult to predict if AI will be limited to a specific role in addressing global challenges due to its rapid advancements. However, AI has proven to outperform general tools or even humans in specific use cases, suggesting its potential to enhance current solutions for these challenges.

AI, as a tool, is reaching a point where it can easily bring ideas into reality or convert information across different formats and standards with minimal effort. This lowers the technical barriers to starting new projects and allows for more trial-and-error iterations in developing possible solutions to these challenges.

AI also excels in identifying relationships between information that may not be easily classified by humans, such as complex interactions related to climate change. Furthermore, AI’s assistance in screening possibilities or providing diverse domain knowledge can streamline the process of understanding these complex relationships that might span across multiple sectors.

Looking ahead, implementing AI into edge sensor devices could lead to extracting more useful information from smaller datasets. This would benefit meta-analysis by reducing the cost of processing massive raw data and providing more summarized details within each intermediate process.

It’s important to note, however, that AI is just one tool among many necessary to address systematic global issues such as climate change or natural crises. Targeted interventions require a combination of technological innovation, policy changes, and societal shifts. Lastly, it’s crucial to prioritize the quality of the data itself over the AI tools when relying on AI to provide solutions or answers to open questions, as the saying goes, “garbage in, garbage out.” Misusing poor-quality data will not benefit society and may even cause harm. However, with the adoption of technology and AI, there is certainly a lot to be said for its potential to contribute to these challenges.

Back to News