MLOps — an enterprise implementation through the CDO’s lens

Meenakshisundaram Thandavarayan
7 min readNov 8, 2023

--

Get busy living or get busy….

As generative AI is becoming more prominent, enterprises are in a frenzy to leverage Gen AI for every use case. There is a greater increase in the adoption of AI solutions within line of businesses (LOBs) and greater dependency on LLMs for decision making with impact to core business processes and decisions. Gen AI is bringing limitless potential, but with great potential comes great risks. Effects of model failure increase business risk from “License to Operate” to compliance, regulatory, financial, and reputational risks. It is imminent to focus beyond financial metrics to govern these risks.

  • Model Opaqueness — This has multiplied n folds with the rise of LLMs. A recent study reveals the level of transparency of the models is < 50% for most models and not very promising.
  • Model Performance - Prompt engineering, guardrails, moderation services are required to limit model risks and hallucinations
  • Instability of Frameworks — Many of the frameworks and plug-ins in support of LLM is nascent and in the works. For example, LangChain, a popular orchestration framework is fit for purpose for Proof of concepts, however it is plagued with dynamic evolution to the framework and backward compatibility challenges
  • Cost of Operations — Self hosting models or using models as service on a production scale is expensive, so is scaling them for multiple concurrent users. On the other hand, token based pricing can become uncontrolled if not provided with thresholds. Need for proper planning on costing the overall solution is key to validate business cases for retrun on investment.
  • Diverse design patterns and multitude of tools: Gen AI application is beyond just models. Each LOB business using their own tools for knowledge base, agents, memory, moderation, guardrails and non-standard design patterns will lead to unsustainable solutions. On the other hand, you will need to have choices and options provided through proper vetting

The role of Chief Data Officers (CDOs) is becoming more important in protecting the organization against these risks whilst modernizing with the pace of change

CDO’s need to strike a right balance between control and innovation in this hyper dynamic environment.

Platform thinking: Consistency in the use of models, tools, frameworks and reusable patterns while keeping up with the pace of change. A platform that leverages the existing investment on Data and AI and extend it to the generative AI Capabilities.

XOps — Operational Excellence Automation and Orchestration of Infra Ops, Data Ops, Model Ops, ML Ops, DevOps with visibility and transparency

Upskilling — Grow AI knowledge, capability, and establish a community to continue to grow our talent and share resources

Process: Create the culture and framework to make responsible AI part of our digital DNA

Activate AI:

is to develop all the necessary foundations (Platforms, tools, services, processes, structures, and guidance) for ML Ops that help enable LOBs to engineer, develop and deploy models into production. This includes Governance framework to ensure integrity, transparency, security, safety, and trustworthiness in our use of AI

Accelerate AI:

Focusses on enabling LOBs for optimal & efficient adoption of AI. This includes but not limited to customized onboarding plan, enablement, self service capabilities, forums, and frameworks. This includes Build a ‘living’ AI community, providing universal access to the best of AI and building a culture of collaboration and connection to advance human machine as the new norm.

There’s a difference between knowing the path and walking the path…

Activate AI: This is achieved using the below pillars of activate AI

  • ML Ops Garage: A collection of approved tools for use by different personas within LOBs to deliver end to end data science solutions. Garage addresses the need of citizen and expert personas to build data engineering workflows, to Models to testing, governance and deployment
  • ML Ops Governance: Controls, practices, documentation, artifacts embedded across the ML life cycle to ensure all models are secure, trusted, traceable, responsible, and sustainable
  • ML Ops Factory: Industrialized pipelines delivering repeatable and continuous deployment of ML models in diverse deployment targets in a performant, cost optimized manner
  • ML Ops Services: Catalog of services provided to LOBs through internal or 3rd party sourcing addressing gaps in capabilities and capacity within LOBs. This includes but not limited Labelling service, data engineering, consulting services etc.

Accelerate AI: Accelerate is focussed on adoption

  • Onboarding the LOB to the ML Ops Framework [ML Ops Springboard]: Enables the LOB to onboard on to the ML Ops capabilities. CDO offers a customized onboarding path based on the level of maturity, capability, capacity and needs of the LOBs. Once onboarded on to CDO ML Ops will be able exploit all the above ML Ops capabilities using a self-service customer experience.
  • Develop using the ML Ops capabilities safely and securely [ML Ops Market place]: One stop for all things development within the Data science life cycle. Data Engineers, data scientist and ML Engineers of all kinds will be able to securely develop their transformations and models using the approved data, frameworks, libraries, accelerators using the ML Ops Marketplace. Marketplace will host — Data Marketplace [internal and externally subscribed data], Model marketplace [promoting reusability], ML Garage [to actionize the data to decisions]
  • Govern, Monitor, and act on the models deployed in production [ML Ops Dashboard]: One stop for all things in production. Once the model is deployed in to production, ML Ops Dashboard gives a singular view of model performance, cost. All modes are monitored for their performance, accuracy, bias, drift, lineage using the ML Ops Dashboard
  • ML Ops Academy: Customized learning and certification paths for LOBs to learn and get certified on foundational capabilities of platform, tools, governance, and factory.
  • ML Ops Community: As LOBs onboard and exploit / use the ML Ops capabilities, CDO will facilitate community of practice and center of excellence for sharing best practices, repeatable patterns

Activate AI — Remember… all I’m offering is the truth. Nothing more

Data and ML Platforms:

Existing investment on the Data and ML platforms forms the foundations for Activate AI capabilities.

If you need to start building you Data Platform from scratch, please refer here for standing up Data Mesh architecture on AWS— https://aws.amazon.com/blogs/architecture/lets-architect-architecting-a-data-mesh/

An illustrative MLOps Platform on AWS is shown below. If you need to build your ML platform from scratch, please refer her on standing up MLOps capability on AWS — https://aws.amazon.com/blogs/machine-learning/mlops-foundation-roadmap-for-enterprises-with-amazon-sagemaker/

Extending Data and ML Platform to build Gen AI foundations:

Enabling generative ai capabilities for LOBs extends beyond Models. This includes

LLM Platform: with choice of models, securely hosted, ability to customize and a testing factory that enables approved list of LLMs for use within LOBs. Open source models or custom fine tuned models created in ML Platform feeds in to the LLM platform

Knowledge Platform: An approved knowledge platform supporting both lexical and semantic search capabilities of unstructured data built using VectorOps from your data platform

Tools Platform provides the necessary tools beyond Models and knowledge base, tools for orchestration, memory, caching, integration, ontology. CDO scope to ensure the best fit tooling is made available for LOBs to accelerate Generative AI.

Enabling Gen AI capabilities as Service

Building these platforms are great, but providing it in a secure, serverless, modular, scalable, consumable by different personas is key.

Amazon Bedrock is build exclusively on these principles.

The modular nature of this platform enables CDO the required standardization and flexibility to keep up with the dynamic pace of gen ai innovations.

  • Gen AI Capabilities — Amazon Bedrock provides all foundational platform capabilities to consume, fine tune, deploy and operationalize Gen AI — Models, Knowledge bases, Agents, Orchestration and customizations
  • Secure by design — All Models are hosted in an AWS environment. No data leaves the AWS environment, data is not stored, used for any retraining purposed
  • Serverless — API driven, No provisioning of infrastructure, scaling. Undifferentiated heavy lifting of hosting Models, Knowledge bases, orchestration are done by Amazon Bedrock
  • No Vendor Lock In — Bedrock provides choice for all capabilities. 1st party, partner models, open source models, supports Vector Databases — Pinecone, Redis, OpenSearch, extends LangChain…
  • Customizable — Bedrock enables fine tuning models to adapt to your data / domain,
  • Extensible — integrates with the AWS ecosystem for foundational and value added capabilities — Storage, Compute, AI Services…. to address pace of Gen AI innovation
  • Gen AI for All: Built for all personas to infuse Gen AI in their applications — No Code, Low Code and High Code users
  • Observability: Provides transparency and control — integration with CloudWatch, CloudTrail, Cost Explorer….

Accelerate AI — My CDO always said life was like a box of chocolates. You never know what you’re gonna need”

Amazon Bedrock provides the right platform for LOBs to accelerate their adoption on Gen AI. LOBs can now, based on their use case, use the standardized patterns

To learn more on the capabilities of Amazon Bedrock, deep dive hands on workshop please get your reinvent tickets asap here — https://reinvent.awsevents.com/

--

--