Data scientists train the AI/ML models for computer vision and Natural language processing (NLP) problems to maximize their performance for better accuracy and precision. It leads to the models having complex parameter configurations and architectures. However, executing inference on these trained ML models in production workloads is an entirely different scenario.
Device and user applications for the industrial, commercial, enterprise, and consumer use cases trigger these inference queries. Data from real-world scenarios is often used for inference serving and the output is sent back to the client applications.
There are constraints like low compute budget at the edge or device layer, stringent latency, and throughput requirements. Different applications have different sensitivities to these constraints.
- Real-time applications that require lightning-fast query responses prioritize latency over other parameters. Environment perception in navigation and control applications often requires real-time inference. In an Artificial Intelligence of Things (AIoT) context, these are generally found in autonomous machines used in mission-critical use cases.
- Analytical insight applications involving complex data analysis prioritize throughput in parallel query execution over other parameters. In AIoT, analyzing the device user activity data and enterprise transactional data through the recommender systems typically have such inference requirements.
Let us understand these key performance parameters:
Latency is expressed in terms of end-to-end time spent in handling an inference query i.e. the request time. It comprises compute time and queue time. Compute time is for the actual request execution in the requisite technology framework and queue time is the wait time before the execution is initiated.
Throughput is about the optimal number of inference queries executed per unit of time, if performed parallelly and in asynchronous mode. It is calculated by executing multiple batches of optimal sizes and recording the time taken. The actual throughput is thus calculated.
What are the constraints in executing ML inference?
- Cost: The total cost for the inference is a major factor in effective AI/ML operationalization. Diverse compute infrastructure for AI and ML in production landscapes includes GPUs and CPUs in the data center or cloud locales. The workloads must utilize the hardware infrastructure optimally to ensure that the cost per inference remains in check. One way to do this is by running concurrent queries i.e. in batches.
- Latency budget: The budget varies by the ML use case. Mission-critical applications often need real-time inference that minimizes the available latency budget. Autonomous navigation, critical material handling, and medical devices often need such low latencies. On the other hand, some use cases with complex data analysis have a relatively high latency budget. Enterprise and consumer decisions need recommendation insights based on the contextual (enterprise internal and external) data. You can run these insights in optimal batches as per the frequency of the inference query and keep the insights ready for the decision-makers to be availed on a tap.
Another key factor is diversity in the enterprise technology landscape. Data scientists from different teams work on diverse problems. Different ML solution development frameworks like Tensorflow, Pytorch, and Keras are best-suited to solve each of these problems. Once these models are trained and released into production, various model configurations need to work well together at scale. Production environments are diverse, with multiple inference locales in scope – device, edge, and data center/cloud. Containerization has become a common practice in enterprise production landscapes. Kubernetes-based deployments have different deployment options. There are widely adopted Kubernetes services from AWS, Microsoft Azure, and Google Cloud, the leading public cloud vendors. There are virtualized and bare-metal hardware deployments also.
How are inference queries generated?
AI and ML inference requests are generated by client applications for making data-based decisions. The decisions range across navigation and control, supply chain optimization, product quality, user safety, medical diagnostics, and care delivery.
Once the user or system generates an inference request, it is typically communicated to and executed in the inferencing platforms using APIs. These APIs communicate the inference request input data, model configuration values, and parameter settings. They often load the models into the inferencing platform from the repository of operationalized models post retooling (conversion, optimization) for production. They also communicate the model output post inference in terms of prediction or classification score back to the client application for further use in the insight or control use case.
What tools are available for ML inference serving?
Various open-source, commercial, and integrated platform tools are available for inference serving. Open-source tools include Tensorflow Serving, TorchServe, Triton, Cortex, and Multi-Server Model. Others include KFServing (part of Kubeflow), ForestFlow, DeepDetect, and BentoML. Most of these support all leading AI/ML development frameworks (TensorFlow, Pytorch, Keras, and Caffe) and seamlessly integrate with leading DevOps and MLOps tool stacks.
eInfochips, an Arrow company, has strong expertise in ML solution development from concept to operationalization. We have helped customers across smart buildings, fleet management, medical devices, retail, and home automation industry segments to build and roll out smart, connected products at scale using computer vision and NLP technologies.