🏛️ NYTimes Secures Access To ChatGPT Logs In Lawsuit Against OpenAI

Legal clash lets NY Times access ChatGPT internals, DeepSeek discount drags latency per SemiAnalysis, AI wrote 13 % 2024 biomed papers, Grok 4 scores leak, free Gemma 3n finetuning guide

Jul 04, 2025

Read time: 10 min

📚 Browse past editions here.

( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).

⚡In today’s Edition (4-July-2025):

🏛️ NY Times wins right to see ChatGPT logs in legal fight with OpenAI
🏆 SemiAnalysis’s new report: Latency became the hidden price of DeepSeek’s discount

🗞️ Top of Github: jsoncrack

🗞️ Byte-Size Briefs:

New research rerpot finds 13% of Biomedical paper used AI for writing in 2024.
XAI’s Grok 4 benchmark chart’ leaked version was surfacing across X and Reddit.

🧑‍🎓 Tutorial: Fine-tune Gemma 3n for free with Unsloth

Connect with me on X (Twitter)

🏛️ NY Times wins right to see ChatGPT logs in legal fight with OpenAI

NY Times wins right to see ChatGPT logs in legal fight with OpenAI

NYT can even search deleted ChatGPT logs 🔍📂, exposing up to 2B private chats and testing OpenAI's privacy safeguards. Judge Sidney Stein rejected OpenAI’s plea to keep standard deletion policies.

Magistrate Ona Wang’s preservation order forces the company to store every nonenterprise chat indefinitely while it negotiates keyword scopes with NYT, Daily News, and CIR.

NYT justified this broad sweep by suggesting that people who use ChatGPT to bypass its paywalls may delete their history after doing so. The newspaper also claims that searching through those logs may prove to be the crux of the whole suit: that OpenAI's large language models (LLMs) are not only trained on its copyrighted material, but are also able to plagiarize that material, too.

Only small, anonymized slices will stay on OpenAI servers, yet they still expose prompts, outputs, and timestamps.

So billions of medical, job, and relationship details now sit in limbo, growing leak risk when multiple firms handle them. The ruling shields enterprise clients while ordinary users lose deletion rights and voice. The order does specifically note that logs from ChatGPT Enterprise and ChatGPT Edu, its custom model specifically for college and universities, will be exempt.

Some observers see leverage that might push OpenAI toward settlement or drive users to rivals like Claude and Gemini, reshaping competition.

OpenAI stores massive archives that the Times and other media can’t realistically comb through. They’ll settle for a limited, keyword filter that stays on OpenAI’s systems, scrubbed of identities and never handed over. Both camps are ironing out the rules and want the logs kept briefly.

🏆 SemiAnalysis’s new report: Latency became the hidden price of DeepSeek’s discount

DeepSeek kept its tokens cheap by making users wait and by limiting context memory, and that pushed many to faster third-party hosts. That strategy comes from tokenomics math where price, latency, and context are three knobs, and by choosing high latency and 64K context they cut GPU cost but hurt experience.

Most probably DeepSeek does this on purpose to save scarce compute for research and pursue AGI, not for running a popular chat service.

The result is falling market share for their own portal while their open models thrive elsewhere, proving that the real cost of a token is hidden in how long and how much it takes to get it

DeepSeek grabbed attention by offering reasoning quality like o1 at only $0.55 input and $2.19 output, but users later left because the service stayed slow and cramped while rivals sped up and widened context.

LLM providers struggle because cheap tokens usually require painful waits, tiny working memory, or both.

DeepSeek’s answer is to batch many conversations on limited GPUs, throttle reply speed, and keep a 64K context so it can train toward AGI while outsiders host faster copies of its model.

🚀 Right after launch, DeepSeek’s own chat and API traffic spiked, but tracking firms now show visits down even as OpenAI, Google, and others rise, meaning curiosity turned into churn once early excitement met sluggish service.

📊 Tokenomics turns every reply into a factory output where latency, interactivity, and context size are knobs that directly set cost per million tokens, so price alone never tells the whole story.

⏳ DeepSeek keeps price low mainly by forcing users to wait well over 25 s before the first word while competitors like Parasail charge roughly $2–$4 and start almost immediately, showing that dollars saved equal seconds lost.

🧠 The model’s 64K context is among the smallest at the top tier, which blocks large‑code and long‑document cases that need steady memory, whereas hosts like Lambda sell the same model with over 2.5× more room for past tokens.

🖥️ Heavy batching on a single GPU cluster cuts cost per token but raises median end‑to‑end latency per user, so DeepSeek sacrifices experience to reserve scarce accelerators for internal research in a compute‑constrained environment shaped by export controls.

🤝 Anthropic faces similar GPU pressure, yet Claude feels quicker than DeepSeek because it outputs fewer tokens for the same task, proving that concise answers can mask slower raw speed and that intelligence per token now matters.

🌐 As wrappers like Cursor and Perplexity explode, third‑party clouds host DeepSeek R1 and V3 with sub‑5 s latency, so the model’s reach grows even while DeepSeek’s own portal fades, underscoring that the battle shifts to inference clouds instead of single‑vendor apps.

Connect with me on X (Twitter)

TOP GITHUB REPO: jsoncrack

Visualize query convert and validate JSON instantly through interactive graphs; transform JSON YAML CSV XML TOML; generate schemas code mock data; execute jq queries; export PNG/SVG; all client-side local processing.
Paste JSON, watch a zoomable map pop, grab CSV and go.
See, edit, and share data structures visually while auto-producing schema and interfaces without leaving your browser.
Turn tangled JSON into clickable graphs, tidy formats, and ready code, fully offline, lightning fast.

🗞️ Byte-Size Briefs

13% of Biomedical paper used AI for writing in 2024.
🧬 The Team scaned 15M PubMed abstracts and shows LLMs shaped ≥13.5% of recent biomedical papers, doubling the vocab jolt seen during COVID, marking an unprecedented shift in scientific writing.
Frequency gaps varied widely. Computational biology, several East-Asian affiliations and fast-turnaround publishers displayed the steepest rises, whereas prestige journals showed much smaller shifts.
The lexical upheaval eclipses past topic-driven swings, showing that LLM tools are rapidly reshaping scientific prose and that naive detection only scratches the surface of their true reach.
XAI’s Grok 4 benchmark chart’ leaked version was surfacing across X and Reddit. And it shwowed Grok-4 scored 45% on HLE (Humanities Last Exam).
Even OpenAI’s “Deep Research” agent, which drew headlines earlier this year, reached only 26.6 % when it combined reasoning with web search.
It (HLE) holds 2,500 expert-written questions spanning more than 100 subjects, including math, physics, computer science and humanities, and 14% of them mix text with images. The authors deliberately built in anti-gaming safeguards and hid a private question set so that simply memorising answers will not help a model.

🧑‍🎓 Tutorial: Fine-tune Gemma 3n for free with Unsloth

Google’s Gemma 3n multimodal model handles image, audio, video, and text inputs. Available in 2B and 4B sizes, it supports 140 languages for text and multimodal tasks. You can now run and fine-tune Gemma-3n-E4B and E2B locally using Unsloth

Unsloth makes Google Gemma training 1.5x faster with 50% less VRAM and 5x longer context lengths - with no accuracy loss.

Check the Colab Notebook

Quick summary: The result is roughly 1.5× higher step‑per‑second, about 50% lower VRAM during training, and the ability to hold sequences 4 times longer than the same model trained with the usual Hugging Face + Flash‑Attention 2 stack.

Why training becomes faster

Gemma 3n keeps the standard transformer math but Unsloth rewrites the kernels in Triton, fusing layer‑norm, attention and MLP ops so they run as one GPU kernel instead of several, which removes a lot of launch overhead and redundant memory traffic github.com. Unsloth also pre‑computes the LoRA weight updates on the CPU and streams them once per step, trimming GPU work even further . On top of that, the framework autocasts only the risky convolutional weights to float32 right before multiplication, avoiding the full float32 copy that other toolchains need when they stumble over the NaN problem in Gemma’s vision blocks unsloth.ai. Together, these shortcuts let each training step finish in about two‑thirds of the time that a stock Transformers pipeline needs, hence the advertised 1.5× speed‑up.

Why memory use drops by roughly 50%

Most of the saving comes from Unsloth’s “Dynamic 4‑bit” quantization. Instead of quantizing every matrix, the algorithm skips the handful of layers whose sensitivity would otherwise hurt accuracy, but still stores the rest in NF4 format; this keeps accuracy near fp16 while cutting parameter memory roughly in half and growing VRAM only 10% over Bits‑and‑Bytes 4‑bit unsloth.ai. Gemma 3n itself helps too: its Per‑Layer Embedding cache lets the huge token embeddings sit in CPU RAM or disk, so only the active slice stays on the GPU during training. Finally, Unsloth’s gradient‑checkpointing system checkpoints after every attention block, then recomputes intermediate activations on demand; this trades a tiny amount of compute for another 30% cut in activation memory, which is what allows the entire 4B model to fit inside 12 GB on a Tesla T4.

How the context window quadruples

Gemma 3n is trained with Rotary‑position interpolation and a 32 k token setting, but with Hugging Face + Flash‑Attention 2 the KV cache still blows up and forces practitioners to stop around 8 k tokens. Unsloth tackles this on two fronts. First, its long‑context gradient‑checkpointing discards KV blocks from local attention layers that will never be reused, keeping only the global‑layer cache that is actually needed for 32 k tokens and beyond unsloth.ai. Second, Gemma’s own architecture has a 5‑local‑to‑1‑global attention ratio, which already shrinks the amount of keys and values that must be stored for long sequences; Google designed this change explicitly to control KV memory at 128 k tokens arxiv.org. The combination lets practitioners hold 32 k tokens on commodity 24 GB GPUs and scales up to even longer windows on larger cards, amounting to the claimed 4× jump over the common 8 k baseline.

Putting the pieces together

Benchmark figures in the Unsloth documentation show Gemma 3n‑E4B training with 1.5× higher throughput, using a little under 12 GB on a T4 when dynamic 4‑bit is enabled, and handling the full 32 k context without out‑of‑memory errors unsloth.ai. Those numbers line up with the theoretical savings from the kernel fusion, selective quantization, embedding caching and KV‑aware checkpointing described above. Because every optimisation is applied at compile time or via lightweight hooks, users interact with the same high‑level API calls they would use in vanilla Transformers, but obtain materially better hardware utilisation.

If you want to reproduce the gains

Run the Unsloth Colab for “gemma‑3n‑E4B” and keep load_in_4bit=True, gradient_checkpointing=True, max_seq_length=32768 and unsloth.compile(model) in the script. These flags activate dynamic 4‑bit, long‑context checkpointing and the Triton compiler passes discussed here unsloth.ai. On a free Colab T4 you will see the memory footprint hover around 11–12 GB and steps per second beat the stock notebook by the advertised margin.

Throughout these changes Unsloth keeps the model numerically identical in forward pass, so prediction quality remains the same while cost per experiment drops sharply.

RUN THE COLAB NOTEBOOK

That’s a wrap for today, see you all tomorrow.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post