Private LLM Development: HARMAN DTS' Journey in HealthcareHARMAN Builds HealthGPT Trained on Clinical Datasets to Solve Customer Challenges
Building a large language model or LLM and training it on enterprise datasets for business use is complicated, expensive and a time-consuming endeavor. When HARMAN Digital Transformation Solutions or DTS undertook this challenge, it faced two choices: using a publicly available LLM such as ChatGPT, Bard or Cohere, and refining its training, or developing a private LLM. It opted to build a private LLM.
While HARMAN DTS caters to six verticals, including healthcare, manufacturing, hospitality, retail, communications and software, the company chose to focus on healthcare first.
"The vision for HealthGPT was to bring the power of private LLMs to end users because many are struggling with using the features of GPT with their enterprise data," said Dr. Jai Ganesh, chief product officer, DTS, HARMAN. "We want to empower and enable our enterprise customers and end users to run extremely complex queries and get insights from enterprise data in a safe and secure manner; at the same time, giving them control over cost and governance."
The team faced a challenge in acquiring high-quality datasets to train a private LLM. However, it discovered clinical trial datasets on sites such as ClinicalTrial.gov, which provided valuable resources for the project.
"We wanted to build something on the private large language side and we chose an industry where high-quality data is available for us to go and train the model," Dr. Ganesh said. "We chose healthcare because we believe this is an industry that is primed for disruption with technology like GPT - if you're able to customize it."
Apart from data, the team confronted other challenges. Its customers in the U.S., EU and India demanded more control over the road map, privacy, compliance and security issues - all at optimized costs. "Healthcare is a regulated industry, and we have to ensure there is no PII hidden in the datasets. It is difficult to build an LLM that complies with industry regulations," Dr. Ganesh said. "This has to be done in cognizance with hospital chains, the pharma companies, or any other entities in the value chain."
Training an Open-Source LLM
To address all these requirements, a team of over 500 members at HARMAN DTS in Bengaluru, India, opted to use an open-source model called Falcon 7B, and train it on publicly available clinical trial datasets from sites like ClinicalTrial.gov.
"A public LLM is trained on publicly available data, so it will not be able to answer highly specific questions posed by business users and customers. The workaround is to start using a public LLM or foundational model and train that with enterprise data; it's an extremely complex task," Dr. Ganesh said.
In contrast, a private LLM can be customized with enterprise data for bespoke business objectives and use cases, in this case, healthcare. But it has to be trained with high-quality data.
HARMAN DTS trained its LLM using a sample dataset comprising 100 to 200 queries, which were subsequently expanded to produce 250,000 training records using various techniques.
The team also created a narrower dataset of 50,000 records for breast cancer, which it validated and fine-tuned on Falcon 7B.
Enhancing LLM Precision
The DTS team, made up of data analysts, data engineers, business experts, visualization experts and even doctors, also faced challenges related to accuracy and model hallucinations.
"We observed that private LLMs have a lesser degree of hallucination and higher accuracy because they have been trained on global data but fine-tuned on limited data. That's not the case with public LLMs, which are trained on much bigger datasets," Dr. Ganesh said.
HARMAN'S in-house team conducted hundreds of experiments on the model and continued refining it until they reached a 95% accuracy level with the responses. HealthGPT is now capable of addressing complex queries in three clinical trial areas: breast cancer, immune diseases and heart disease, Dr. Ganesh said.
Private LLMs Beyond Healthcare
The team automated several steps in the process while building HealthGPT. After their achievement in the healthcare domain, the HARMAN DTS team is considering private LLMs for the five other aforementioned verticals they service.
"We believe the methodology and the tools we have built for accelerating the development can be extended to other industry verticals. So we'll be launching GPT equivalents for other industries soon," Dr. Ganesh said.
HARMAN'S insights into the potential of private LLMs illustrate the evolving landscape of technology, particularly in regulated industries such as healthcare, and the ongoing quest for secure, cost-effective and insightful solutions. The DTS team also worked with the healthcare ecosystem to ensure compliance, built-in security and privacy by design.
This approach can serve as a valuable use case for businesses seeking to build their LLMs and train them on enterprise data.