Integrating LLMs in scientific workflows
Overview
Teaching: 30 min
Exercises: 40 minQuestions
How can LLM and AI be used at all stages in science
What techniques can I apply at each stage
Objectives
Explore ways to integrate and automate LLMs in scientific workflows
Develop a repoitre of techniques
Scientific Workflows
Scientific workflows here means the process in a science project of transforming inputs into value added delivered outputs. This process might be partly automated. A typical project might consist of understanding the problem, researching existing literature, developing new ideas and methods, analysing existing data, performing experiments, assuring quality and delivering science.
LLM are very general purpose and can help at many stages. Because LLM are new it’s often the case that we may not immediately think to apply a LLM Need to fix diagram
graph LR
%% Main Workflow
subgraph cluster1 [Main Workflow]
A([Step 1 - Problem Understanding]) --> B([Step 2 - Literature Search]) --> C([Step 3 - Ideation]) --> D([Step 4 - Data Analysis]) --> E([Step 5 - Experiment]) --> F([Step 6 - Data Processing]) --> G([Step 7 - QA]) --> H([Step 8 - Publication]) --> I([Step 9 - Communication and Use])
%% Additional boxes with color coding
A --> A1[Ask for different perspectives]
A --> A2[Check your understanding providing the problem and your interpretation]
A --> A3[Get an overview of related fields]
A --> A4[Develop further questions for the client]
B --> B1[Get an overview of related fields]
B --> B2[Analyze publications with problem context]
B --> B3[Learn required new knowledge]
B --> B4[Summarize and organize]
C --> C1[Check your understanding providing the problem and your interpretation]
C --> C2[Learn required new knowledge]
C --> C3[Brainstorm ideas]
C --> C4[Look for flaws]
C --> C5[Facilitate different perspectives]
D --> D1[Build queries for data]
D --> D2[Write code for visualization]
D --> D3[Write code for statistics]
D --> D4[Write code for algorithms or models]
E --> E1[Manage experiments]
E --> E2[Capture provenance]
E --> E3[Design experiments]
E --> E4[Assist with approval]
F --> F1[Build queries for data]
F --> F2[Write code for visualization]
F --> F3[Write code for statistics]
F --> F4[Write code for algorithms or models]
G --> G1[Design QA metrics and goals]
G --> G2[Code effective human-in-the-loop QA]
G --> G3[Summarize and organize]
H --> H1[Set publication goals]
H --> H2[Progressively assess writing]
I --> I1[Build interactive visualization]
I --> I2[Integrate into downstream systems]
end
%% Overarching Concerns
subgraph cluster2 [Overarching Concerns]
direction TB
OC1[Accuracy and Verification]
OC2[Data Privacy]
OC3[Knowledge Cutoff]
OC4[Ethical Use]
end
%% Style definitions for color coding duplicated tasks
style A3 fill:#FFDDC1,stroke:#333,stroke-width:2px
style B1 fill:#FFDDC1,stroke:#333,stroke-width:2px
style B3 fill:#FFE6AA,stroke:#333,stroke-width:2px
style C2 fill:#FFE6AA,stroke:#333,stroke-width:2px
style D1 fill:#C1E1FF,stroke:#333,stroke-width:2px
style F1 fill:#C1E1FF,stroke:#333,stroke-width:2px
style D2 fill:#C1E1FF,stroke:#333,stroke-width:2px
style F2 fill:#C1E1FF,stroke:#333,stroke-width:2px
style G3 fill:#FFEBB7,stroke:#333,stroke-width:2px
style B4 fill:#FFEBB7,stroke:#333,stroke-width:2px
%% Style definitions for main workflow nodes
style A fill:#A9D18E,stroke:#38761D,stroke-width:2px
style B fill:#A9D18E,stroke:#38761D,stroke-width:2px
style C fill:#A9D18E,stroke:#38761D,stroke-width:2px
style D fill:#A9D18E,stroke:#38761D,stroke-width:2px
style E fill:#A9D18E,stroke:#38761D,stroke-width:2px
style F fill:#A9D18E,stroke:#38761D,stroke-width:2px
style G fill:#A9D18E,stroke:#38761D,stroke-width:2px
style H fill:#A9D18E,stroke:#38761D,stroke-width:2px
style I fill:#A9D18E,stroke:#38761D,stroke-width:2px
Throughout the process we can broadly catagorize different techniques:
1. General Thought Enhancement and Problem Understanding
Utilizing LLMs can help better understand and work with complex knowledge - An LLM typically has a comprehensive ability to understand and advise. LLMs can be overly vanilla, that is they may lack the ability to be very imaginative. This means that while an LLM might help you think of things you hadn’t thought of it maybe constrained. It’s important to critically assess the LLMs reasoning and understanding, add your own perspectives, critique the LLMs perspective and identify missing pieces. However, LLMs are also highly capable and it is valuable not to assume that you have already have a comprehensive and complete understanding. LLMs can often help refine existing knowledge and provide missing knowledge.
Ask for different perspectives - LLMs can provide valueable critique and often illuminate ways of looking at problems and knowledge that might not be initially obvious. However LLMs can miss context easily and may provide perspectives not relevant to the context.
Check your understanding by providing the problem and your interpretation - A powerful technique for working with LLMs is to rephrase or recontextualise learning and knowledge in your own words, that is producing an analogy, concrete use cases, implication or other synthesis of knowledge. LLMs can readily adapt to your interpretation and often provide good feedback on whether it is correct.
Get an overview of related fields - LLMs have broad knowledge and can advise on a interconnected fields and knowledge however they may not be up to date with the latest research.
Develop further questions for the client - LLMs can be useful in facilitating conversations, LLMs can help write and refine questions that help understand the science problem and assist in communicating with clients.
Analyze publications with problem context - There are many emerging tools for analyzing literature with LLMs, LLMs can greatly accelerate the rate at which publications can be used. The downside is the fully subtly and context of a publication might be lost if LLMs are used for summarization or targeted analysis. There is also a risk an LLM is biased and miscontrues or misses important information.
Summarize and organize - LLMs can be used to plan and summarize at many stages in the scientific process. They can be used to rapidly harmonize disorganised information and with humans in the loop reviewing can be a potential way to accelerate standardisation and metadata generation
Learn required new knowledge - For a range of topics LLMs have good broad knowledge. Typically they can teach new skills effectively. Their knowledge of new knowledge or esoteric fields maybe limited or inaccurate.
Brainstorm ideas - LLMs can be a good conversational partner when brainstorming ideas. Beware that they are often overly positive and may need to be instructed to be more critical
Look for flaws - When specifically instructed LLMs maybe able to find flaws in documentation, existing literature, code, and methods.
Facilitate different perspectives - LLMs can sometimes bring a new perspective when specifically instructed. They maybe biased though and not provide a comprehensive or diverse set of perspectives
Set publication goals - LLMs to set predetermined publications goals.
Progressively assess writing - With predetermined publication goals an LLM can progressively assess writing
2. Coding and Data Analysis
Build queries for data - LLMs can build queries but caution should be exercised and it is best to understand the query the LLM has written to assure it is behaving as desired
Write code for visualization - LLMs can speed up the ability to visualise data in different respects and are very good at using libraries like matplotlib in python
Write code for statistics - LLMs have a broad knowledge of statistical techniques and can assist in writing statistical analysis
Write code for algorithms or models - LLMs can be useful for writing algoirthms or models but generally for novel work must be carefully directed and assessed.
3. Data Management, Experiment Design, and Provenance
Manage experiments - LLMs maybe able to manage experiment planning and execution with a human
Capture provenance - LLMs can extract and formalize provenance information
Design experiments - LLMs maybe able to help design and formalize experiments
Assist with approval - LLMs might be able to assist in writing proposals and assist with understand and writing approvals
4. Data Quality Assurance (QA) and Evaluation
Design QA metrics and goals - QA of data, process, and results can be improved with LLMs that can help establish and document metrics and goals.
Code effective human-in-the-loop QA - Humans are critical in many QA processes but they can work together with LLMs to rapidly write and execute tests
5. Communication and Integration
Build interactive visualizations - LLMs can do things like rapidly build an interactive website
Integrate into downstream systems - Downstream systems might require a different data format an LLM can help write code to transform data for different integrations or maybe able to directly transform small ammounts of data
Sketch your science workflow
Sketch out a simple science workflow in your context
Note points where an LLM could be useful
Enhace your diagram by connecting techiques for working with LLMs to appropriate steps in your workflow
Write an example query for a workflow step
For some of the enhancement you’ve identified write an example query for the LLM and note the response. Is it useful?
Note GPT-4 and o1 preview were used to review material, iteratively generate and refine the diagram and contributed most of the overarching concerns
Key Points
LLMs can be integrated at many points in a scientific workflow
Techniques include LLMs for, literature review, data understanding, ideation, coding, quality assurance, and writing