Data engineers are often confused for the better-known data scientists. Yet, they are particularly popular on the tech job market today, in the wake of new Big Data-related technologies. So what do data engineers do, exactly? Let’s talk with Jérémy Caldas, data engineer at fifty-five, to find out.
Hey Jérémy! Can you tell us what exactly data engineers do, and especially about your role at fifty-five?
As a data engineer at 55, my job is to structure data feeds (or pipelines) and make them available for users to use in the best-possible conditions. Pipelines, composed of several steps, are technical solutions to process data. We provide them to our clients and internal users to answer their needs.
First, we collect data at the source, either through external services or directly from the client. This could mean anything from developing collectors to APIs or other services. Ideally, the data collection technique should be reusable, for similar future operations.
Next, we make this data readily available to data scientists and data analysts under the best possible conditions: with reasonable access time and intuitive means of access, remaining as cost-effective as possible. At 55, we use a large-scale data analysis tool like Google BigQuery or Amazon Athena.
Lastly, we deliver the “refined” data (meaning aggregated and structured) to the client, who can then visualise data using dashboards, for instance. This “refined data” can be obtained many ways, from common aggregations to complex machine learning algorithms.
Simultaneously, we continuously build and structure tools for internal teams. The above processes must be made available for others, to create reusable technological bricks. This work means that we can then leverage what has already been done to build future projects.
How do you work with data analysts and data scientists on a daily basis? What are the differences and similarities?
Data scientists and data analysts are data engineers’ best friends! We help build the bridge between software engineering and data science, or between programming and data architecture and expertise in statistics. Ultimately, we are responsible for the technological foundation, upon which all other functions in the data world are built.
Data engineers work hand in hand with data analysts to cover both the business and technical sides of projects. Data analysts are highly knowledgeable about business peculiarities, and thus understand client needs. With them, we ensure that everything done within the data pipeline respects best practices: security, resilience to data delivery errors, maintainability and efficiency in terms of costs (platform/cloud).
Data engineers must also be able to call the decisions of data scientists into question, particularly for algorithmic models and calculating environments that they use to produce results and analyses. This collaboration between data scientists and data engineers is essential to ensuring that models are valid and, once again, cost effective.
How would you describe the key skills and qualifications necessary to being a data engineer? Is there a preferable educational background?
I studied at EISTI, a French engineering school, and at Cranfield University in the UK. My studies combined software engineering, mathematics, and machine learning – which is a great combination for my job. At 55, other members of my team have software engineering skills (with Python and Java, etc.) and are interested in data science. But in this field, education never ends. There is always more to learn, in terms of key concepts as well as emerging technologies!
To become a data engineer, you have to be curious and versatile. You have to see the big picture of what goes on within your company. You must understand which technologies are being used, as well as the role of every person who has access to data. You have to constantly ask yourself if data is being processed in the right way – that is to say, you have to anticipate that a new usage might develop, which could require moving to a new scale. You’ve also got to look out for problems that might arise, so that you can have solutions at the ready.
What advice would you give to other data engineers?
Three concepts come to mind: DevOps, open source, and the cloud. I think that data engineers must understand and practice the DevOps philosophy. We often have to deal with roadblocks, but other companies (particularly Google, Airbnb, Spotify…) have often already encountered the same problem and share their solutions for all to access under open source. In turn, we can improve these solutions if necessary or possible. Tools provided by the cloud (Google DataProc and Amazon EMR, for example) are also really useful, as they allow us to put our storage and processing solutions in place quickly and easily (like Hadoop and Spark).
We are also responsible for knowing and choosing the right technologies to facilitate internal procedures. These choices are important as they have structural consequences, and cannot be called into question on a daily basis. We focus on regular communication and discussion within the data engineering team in order to find the right solutions. We also share our own skills and knowledge and our latest findings about technological developments internally.
Being a data engineer at fifty-five means constantly growing and learning more about all these concepts!
How do you think this job will evolve in the medium term?
Data engineers are pretty hot on today’s job market. According to IBM’s estimates, annual demand for data scientists, data developers, and data engineers should reach 700,000 in the year 2020 in the US. Here in France, demand is also skyrocketing. It’s a good time to be a data engineer — and the forecast is sunny!
The level of versatility necessary is actually pretty rare, and requires sincere curiosity – not just for the technical aspects of the job, but also for the business side of it and its field of application. The best data engineers often have software engineering backgrounds, and learned the part related to the field of application once on the terrain.
I think that data engineers are more important today than ever before, as everything today creates data, and data is becoming central to businesses. Having a comprehensive and robust IT infrastructure that stores data, facilitates its exploitation, and automates processing is an investment that all companies should consider on the path to becoming fully data-driven!
Translated from French to English by Niamh Cloughley.