02 October 2017
There are more data scientists than you think
Data scientists and machine learning (ML) engineers are increasingly in demand, and companies of all sizes are paying top dollar for experienced candidates. The supply of data scientists and ML engineers with experience or training in these roles is low but high quality data science talent is available to companies who take a broader outlook on the profile.
Academics in many disciplines are increasingly taking advantage of advances in data collection and analytics in their research. Students in these disciplines are both digitally native and have have experience using these data analytics tools from their study and research. Thinking of a data scientist as a collection of skills instead of a prepackaged role, employers can dramatically expand the potential candidate pool.
Many of the skills required of data scientists are based on fundamental statistics, including model estimation and inference. Data analysis employs statistical techniques including various types of plots (starting from histogram, line graphs, heat maps), linear regression and correlation, factor analysis, and principal component analysis.
Statistics-centric subfields are the norm across science and social science disciplines, not just math and computer science. Some of those subfields are obvious from their names, like biostatistics, econometrics, or statistical mechanics. Other less obviously data-science centric fields include:
- Demography is the statistical study of all populations. It can be a very general science that can be applied to any kind of dynamic population, that is, one that changes over time or space.
- Epidemiology is the study of factors affecting the health and illness of populations, and serves as the foundation and logic of interventions made in the interest of public health and preventive medicine.
- Geostatistics is a branch of geography that deals with the analysis of data from disciplines such as petroleum geology, hydrogeology, hydrology, meteorology, oceanography, geochemistry, geography.
- Operations research (or operational research) is an interdisciplinary branch of applied mathematics and formal science that uses methods such as mathematical modeling, statistics, and algorithms to arrive at optimal or near optimal solutions to complex problems.
- Population ecology is a sub-field of ecology that deals with the dynamics of species populations and how these populations interact with the environment.
- Quality control reviews the factors involved in manufacturing and production; it can make use of statistical sampling of product items to aid decisions in process control or in accepting deliveries.
- Quantitative psychology is the science of statistically explaining and changing mental processes and behaviors in humans.
Wikipedia offers a more comprehensive list of quantitative subfields here.
Data science projects in these disciplines might look like: taxonomy creation through text mining, clustering applied to big data sets, simulations, rule systems for statistical scoring engines, root cause analysis for meteor detection, predicting the emergence of pandemics, or tracking growth of radicalized groups across a region in conflict. All experiences that would be highly relevant in an AI-based technology startup!
There is no one-size fits all data scientist, nor is there one single data scientist archetype. Here’s a partial overview of different types of data scientists and how candidates from each of the above backgrounds could bring useful skills to each type.
Types of data scientists
- Data scientists strong in statistics are experts in statistical modeling, experimental design, sampling, clustering, data reduction, confidence intervals, testing, modeling, predictive modeling and other related techniques. These candidates can come from fields like biostatistics, statistical signal processing, econometrics, or actuarial science.
- Data scientists strong in mathematics who can perform analytic business optimization (inventory management and forecasting, pricing optimization, supply chain, quality control, yield optimization), collecting, analyzing, and extracting value out of data. Researchers from astrostatistics, operations research, physicists and chemometrics would be well suited to solving such problems.
- Data scientists strong in business-related tasks traditionally performed by business analysts in bigger companies, such as dashboard design, metric mix selection and metric definition, ROI optimization, or high-level database design. These candidates might have studied reliability engineering, epidemiology, quality control, or operations research.
- Data scientists strong in visualization. These candidates might come from quantitative psychology, demography, population ecology, chemometrics, or geostatistics.
- Data scientists strong in GIS or spatial data, such as those who come from geostatistics, demography, environmental statistics, or epidemiology backgrounds
- Those strong in data engineering, Hadoop, database/memory/file systems optimization and architecture, APIs, Analytics as a Service, optimization of data flows, and data plumbing, such as students of biostatistics or astrostatistics.
As more businesses recognize the value of data they will begin to incorporate data science more closely with their decision making and day-to-day operations. When this happens, teams that take a statistics-first approach to hiring data scientists may increase productivity by pairing their data science team with a software engineer who can support them in writing efficient, production-quality code that can scale with the growth of the organization.
Finding candidates who are skilled both on the analysis side and the engineering side of data science greatly limits the pool of available candidates. By decoupling the data science aspects of the work from the software engineering requirements, hiring managers can greatly increase their pool of talented candidates.