Make sure you understand that Data Scientist and Data Engineer are not the same thing.
A Data Scientist builds models using mathematics, statistics and machine learning to explain and predict complex behavior, and codifies those models into real-world software. A Data Engineer designs and builds data architectures for ingestion, processing, and surfacing data for large-scale data-intensive applications.
Often the Data Scientist and Data Engineer will work together to build an end-to-end solution for companies requiring advanced analytical models that are operationalized at scale. The Data Scientist is interested in large scale architecture only insomuch as it allows the "science to scale." Thus any Big Data project should have a Data Scientist alongside the Data Engineer to ensure that what gets built is analytically sound (no point in engineering a big data architecture that doesn't prepare and process data in a way that supports the specific models built by the Scientist).
As with the Data Scientist there is no formal path to becoming a Data Engineer since it is a unique blend of skills that have been brought together to form a distinct and much needed discipline. The requirements for a Data Scientist are typically more "academic" as they are expected to understand and conduct scientific research and know how to build and test advanced models. PhDs are often sought for Data Science, with backgrounds in the hard sciences or computer science. Data Engineers typically come from an engineering background with less emphasis on the academic background, although many still have Masters degrees. Often developers interested in designing large scale architectures for data-intensive applications can move towards this field as there is much less emphasis on science and math and more on engineering and development.
Since there is less emphasis on advanced academic background it can be easier for someone to move into the Data Engineering field compared to Data Science.
Data Engineers should understand the core concepts in computer science and should be very well versed in building and designing large scale applications; end-to-end. They should understand the pros and cons of using relational and noSQL databases. They must know how to design effective pipelines for both batch and streaming use cases. They must know what it takes to operationalize a working model and how to help push some of the "lab" specifics (training and validation) into real-time engines. They must understand distributed computing and should be able to work with the Data Scientist to help split algorithms effectively to still yield predictive accuracy across a variety of domains. They should know when to push schemas towards the application to allow for "data lake" designs that assist in large scale analysis but still serve domain-specific applications. And they should be very familiar with the core technologies that are used to build these systems.
The only way to become great at what you do is to constantly and relentlessly jump into problems and solve them using the techniques and approaches of your choice field. There are many books and tutorials on the technology used by the Data Engineer. Choose a problem and a public dataset and build a system. Build many systems. Fail again and again and learn what it takes to bring an end-to-end solution to a real-world problem. Anyone can take a course or read a book. But if you have real projects you have built and worked on you can point to these and discuss in-depth the challenges you faced and how you solved them.
This is the age of free information. There is nothing you can't learn by researching and practicing. Jump into as many problems as you can and build working solutions you can show people. When you have real-world applications that you have built, you don't have to convince people of what you know. You simply show them.
Good luck.
Comments
Post a Comment